Data Ingestion API

The Data Ingestion API allows you to upload files, connect to cloud storage providers, and monitor the processing status of your documents.

Upload Files

Upload multiple files for processing and embedding.

POST /api/data-ingestion/upload-files

Request Body

ParameterTypeDescriptionDefault
filesarrayArray of files to upload (multipart/form-data)Required
chunk_sizeintegerSize of text chunks for processing500
overlap_pctintegerPercentage of overlap between chunks10
embedding_modelstringModel to use for generating embeddings”default-embedding”
pdf_extractorstringPDF extraction method (pypdf2, pdfplumber, pymupdf, pymupdf4llm)“pypdf2”

Response

Returns a status object with information about the uploaded files and processing status.

Connect to Cloud Storage

Connect to a cloud storage provider and ingest files.

POST /api/data-ingestion/connect-cloud

Request Body

ParameterTypeDescriptionDefault
providerstringCloud provider (gdrive, dropbox, onedrive, s3, azure, gdocs, web, excel)Required
access_tokenstringAccess token for the cloud providerRequired
refresh_tokenstringRefresh token (if applicable)null
token_expirystringToken expiry timestampnull
folder_pathstringPath to folder in cloud storagenull
chunk_sizeintegerSize of text chunks for processing500
overlap_pctintegerPercentage of overlap between chunks10
embedding_modelstringModel to use for generating embeddings”default-embedding”
pdf_extractorstringPDF extraction method”pypdf2”

Response

Returns a status object with information about the connected cloud storage and processing status.

Get Ingestion Status

Get the status of a specific ingestion process.

GET /api/data-ingestion/status/{file_id}

Path Parameters

ParameterTypeDescription
file_idstringID of the file to check status for

Response

Returns the current status of the ingestion process for the specified file.

List Ingested Files

Get all ingested files and their status.

GET /api/data-ingestion/files

Response

Returns a list of all ingested files with their processing status.

Delete All Ingested Files

Delete all ingested files and their associated data.

DELETE /api/data-ingestion/files

Response

Returns a confirmation of the deletion operation.

Delete Specific Ingested File

Delete an ingested file and all its associated data.

DELETE /api/data-ingestion/files/{file_id}

Path Parameters

ParameterTypeDescription
file_idstringID of the file to delete

Response

Returns a confirmation of the deletion operation.

Ingest URL

Ingest a document from a URL.

POST /api/data-ingestion/ingest-url

Request Body

ParameterTypeDescriptionDefault
urlstringURL of the document to ingestRequired
filenamestringCustom filename for the documentnull
chunk_sizeintegerSize of text chunks for processing (100-2000)500
overlap_pctintegerPercentage of overlap between chunks (0-50)10
embedding_modelstringModel to use for generating embeddings”default-embedding”
pdf_extractorstringPDF extraction method”pypdf2”

Response

Returns a status object with information about the ingested URL and processing status.

Reprocess File

Retry processing a file that already exists in the system.

POST /api/data-ingestion/files/{file_id}/reprocess

Path Parameters

ParameterTypeDescription
file_idstringID of the file to reprocess

Query Parameters

ParameterTypeDescriptionDefault
chunk_sizeintegerSize of text chunks for processing500
overlap_pctintegerPercentage of overlap between chunks10
embedding_modelstringModel to use for generating embeddings”default-embedding”
pdf_extractorstringPDF extraction method”pypdf2”

Response

Returns a status object with information about the reprocessing operation.

Update Ingestion Settings

Update default ingestion settings.

POST /api/data-ingestion/settings

Request Body

ParameterTypeDescriptionDefault
chunk_sizeintegerSize of text chunks for processing (100-2000)500
overlap_pctintegerPercentage of overlap between chunks (0-50)10
embedding_modelstringModel to use for generating embeddings”default-embedding”
pdf_extractorstringPDF extraction method”pypdf2”

Response

Returns the updated settings configuration.