Data Ingestion Guide
Learn how to upload and process documents with Vedaya
Data Ingestion with Vedaya
This guide walks you through the process of uploading and processing documents using Vedaya’s Data Ingestion API.
Overview
Vedaya’s Data Ingestion system allows you to:
- Upload files directly from your device
- Connect to cloud storage providers (Google Drive, Dropbox, OneDrive, etc.)
- Ingest content from URLs
- Configure processing parameters like chunking size and embedding models
Upload Files
Direct Upload
The simplest way to get started is by uploading files directly:
This will return a response with the status of your upload and processing:
Cloud Integration
Connect to cloud storage providers to ingest files:
For OAuth-based providers, we provide endpoints to handle the authentication flow. See the Cloud Integration API documentation for details.
Ingesting from URLs
To fetch and process content from a URL:
Monitoring Ingestion Status
After uploading files, you can check their processing status:
Customizing Ingestion Parameters
Chunking Strategy
Chunking controls how documents are split into smaller pieces for processing and retrieval:
chunk_size
: Number of characters in each chunk (default: 500)overlap_pct
: Percentage of overlap between chunks (default: 10%)
Larger chunks contain more context but can reduce precision during retrieval. Smaller chunks improve precision but may lose context.
PDF Extraction Options
Vedaya supports multiple PDF extraction methods:
pypdf2
: Fast and lightweight (default)pdfplumber
: Better for complex layoutspymupdf
: High performance with advanced featurespymupdf4llm
: Optimized for language model processing
Choose based on your document complexity and extraction needs.
Best Practices
- File Size: Keep individual files under 20MB for optimal processing
- File Types: Supported formats include PDF, DOCX, TXT, CSV, PPTX, and HTML
- Batch Processing: For large collections, use multiple requests in parallel
- Error Handling: Monitor the status endpoint to catch and retry failed ingestions
- Embedding Models: Start with the default model and experiment with others as needed
For more details on available endpoints and parameters, see the Data Ingestion API Reference.