Data Ingestion
API endpoints for uploading and processing documents
Data Ingestion API
The Data Ingestion API allows you to upload files, connect to cloud storage providers, and monitor the processing status of your documents.
Upload Files
Upload multiple files for processing and embedding.
Request Body
Parameter | Type | Description | Default |
---|---|---|---|
files | array | Array of files to upload (multipart/form-data) | Required |
chunk_size | integer | Size of text chunks for processing | 500 |
overlap_pct | integer | Percentage of overlap between chunks | 10 |
embedding_model | string | Model to use for generating embeddings | ”default-embedding” |
pdf_extractor | string | PDF extraction method (pypdf2, pdfplumber, pymupdf, pymupdf4llm) | “pypdf2” |
Response
Returns a status object with information about the uploaded files and processing status.
Connect to Cloud Storage
Connect to a cloud storage provider and ingest files.
Request Body
Parameter | Type | Description | Default |
---|---|---|---|
provider | string | Cloud provider (gdrive, dropbox, onedrive, s3, azure, gdocs, web, excel) | Required |
access_token | string | Access token for the cloud provider | Required |
refresh_token | string | Refresh token (if applicable) | null |
token_expiry | string | Token expiry timestamp | null |
folder_path | string | Path to folder in cloud storage | null |
chunk_size | integer | Size of text chunks for processing | 500 |
overlap_pct | integer | Percentage of overlap between chunks | 10 |
embedding_model | string | Model to use for generating embeddings | ”default-embedding” |
pdf_extractor | string | PDF extraction method | ”pypdf2” |
Response
Returns a status object with information about the connected cloud storage and processing status.
Get Ingestion Status
Get the status of a specific ingestion process.
Path Parameters
Parameter | Type | Description |
---|---|---|
file_id | string | ID of the file to check status for |
Response
Returns the current status of the ingestion process for the specified file.
List Ingested Files
Get all ingested files and their status.
Response
Returns a list of all ingested files with their processing status.
Delete All Ingested Files
Delete all ingested files and their associated data.
Response
Returns a confirmation of the deletion operation.
Delete Specific Ingested File
Delete an ingested file and all its associated data.
Path Parameters
Parameter | Type | Description |
---|---|---|
file_id | string | ID of the file to delete |
Response
Returns a confirmation of the deletion operation.
Ingest URL
Ingest a document from a URL.
Request Body
Parameter | Type | Description | Default |
---|---|---|---|
url | string | URL of the document to ingest | Required |
filename | string | Custom filename for the document | null |
chunk_size | integer | Size of text chunks for processing (100-2000) | 500 |
overlap_pct | integer | Percentage of overlap between chunks (0-50) | 10 |
embedding_model | string | Model to use for generating embeddings | ”default-embedding” |
pdf_extractor | string | PDF extraction method | ”pypdf2” |
Response
Returns a status object with information about the ingested URL and processing status.
Reprocess File
Retry processing a file that already exists in the system.
Path Parameters
Parameter | Type | Description |
---|---|---|
file_id | string | ID of the file to reprocess |
Query Parameters
Parameter | Type | Description | Default |
---|---|---|---|
chunk_size | integer | Size of text chunks for processing | 500 |
overlap_pct | integer | Percentage of overlap between chunks | 10 |
embedding_model | string | Model to use for generating embeddings | ”default-embedding” |
pdf_extractor | string | PDF extraction method | ”pypdf2” |
Response
Returns a status object with information about the reprocessing operation.
Update Ingestion Settings
Update default ingestion settings.
Request Body
Parameter | Type | Description | Default |
---|---|---|---|
chunk_size | integer | Size of text chunks for processing (100-2000) | 500 |
overlap_pct | integer | Percentage of overlap between chunks (0-50) | 10 |
embedding_model | string | Model to use for generating embeddings | ”default-embedding” |
pdf_extractor | string | PDF extraction method | ”pypdf2” |
Response
Returns the updated settings configuration.