Data Ingestion with Vedaya

This guide walks you through the process of uploading and processing documents using Vedaya’s Data Ingestion API.

Overview

Vedaya’s Data Ingestion system allows you to:

  • Upload files directly from your device
  • Connect to cloud storage providers (Google Drive, Dropbox, OneDrive, etc.)
  • Ingest content from URLs
  • Configure processing parameters like chunking size and embedding models

Upload Files

Direct Upload

The simplest way to get started is by uploading files directly:

import requests

url = "https://vedaya-backend.fly.dev/api/data-ingestion/upload-files"

# Prepare the files and parameters
files = [
    ('files', ('document.pdf', open('document.pdf', 'rb'), 'application/pdf'))
]
data = {
    'chunk_size': 500,
    'overlap_pct': 10,
    'embedding_model': 'default-embedding',
    'pdf_extractor': 'pypdf2'
}

# Send the request with your API key
headers = {
    'Authorization': 'Bearer YOUR_API_KEY'
}
response = requests.post(url, headers=headers, files=files, data=data)
print(response.json())

This will return a response with the status of your upload and processing:

{
  "status": "success",
  "message": "Files uploaded successfully",
  "file_ids": ["f12345678"],
  "processing_status": "queued"
}

Cloud Integration

Connect to cloud storage providers to ingest files:

import requests
import json

url = "https://vedaya-backend.fly.dev/api/data-ingestion/connect-cloud"

payload = json.dumps({
  "provider": "gdrive",
  "access_token": "YOUR_ACCESS_TOKEN",
  "refresh_token": "YOUR_REFRESH_TOKEN",
  "folder_path": "My Documents/Data",
  "chunk_size": 500,
  "overlap_pct": 10
})

headers = {
  'Authorization': 'Bearer YOUR_API_KEY',
  'Content-Type': 'application/json'
}

response = requests.post(url, headers=headers, data=payload)
print(response.json())

For OAuth-based providers, we provide endpoints to handle the authentication flow. See the Cloud Integration API documentation for details.

Ingesting from URLs

To fetch and process content from a URL:

import requests
import json

url = "https://vedaya-backend.fly.dev/api/data-ingestion/ingest-url"

payload = json.dumps({
  "url": "https://example.com/document.pdf",
  "filename": "example-document",
  "chunk_size": 500,
  "overlap_pct": 10
})

headers = {
  'Authorization': 'Bearer YOUR_API_KEY',
  'Content-Type': 'application/json'
}

response = requests.post(url, headers=headers, data=payload)
print(response.json())

Monitoring Ingestion Status

After uploading files, you can check their processing status:

import requests

file_id = "f12345678"  # The ID returned from upload
url = f"https://vedaya-backend.fly.dev/api/data-ingestion/status/{file_id}"

headers = {
  'Authorization': 'Bearer YOUR_API_KEY'
}

response = requests.get(url, headers=headers)
print(response.json())

Customizing Ingestion Parameters

Chunking Strategy

Chunking controls how documents are split into smaller pieces for processing and retrieval:

  • chunk_size: Number of characters in each chunk (default: 500)
  • overlap_pct: Percentage of overlap between chunks (default: 10%)

Larger chunks contain more context but can reduce precision during retrieval. Smaller chunks improve precision but may lose context.

PDF Extraction Options

Vedaya supports multiple PDF extraction methods:

  • pypdf2: Fast and lightweight (default)
  • pdfplumber: Better for complex layouts
  • pymupdf: High performance with advanced features
  • pymupdf4llm: Optimized for language model processing

Choose based on your document complexity and extraction needs.

Best Practices

  1. File Size: Keep individual files under 20MB for optimal processing
  2. File Types: Supported formats include PDF, DOCX, TXT, CSV, PPTX, and HTML
  3. Batch Processing: For large collections, use multiple requests in parallel
  4. Error Handling: Monitor the status endpoint to catch and retry failed ingestions
  5. Embedding Models: Start with the default model and experiment with others as needed

For more details on available endpoints and parameters, see the Data Ingestion API Reference.