Data Ingestion with Vedaya

This guide walks you through the process of uploading and processing documents using Vedaya’s Data Ingestion API.

Overview

Vedaya’s Data Ingestion system allows you to:

Upload files directly from your device
Connect to cloud storage providers (Google Drive, Dropbox, OneDrive, etc.)
Ingest content from URLs
Configure processing parameters like chunking size and embedding models

Upload Files

Direct Upload

The simplest way to get started is by uploading files directly:

import requests

url = "https://vedaya-backend.fly.dev/api/data-ingestion/upload-files"

# Prepare the files and parameters
files = [
    ('files', ('document.pdf', open('document.pdf', 'rb'), 'application/pdf'))
]
data = {
    'chunk_size': 500,
    'overlap_pct': 10,
    'embedding_model': 'default-embedding',
    'pdf_extractor': 'pypdf2'
}

# Send the request with your API key
headers = {
    'Authorization': 'Bearer YOUR_API_KEY'
}
response = requests.post(url, headers=headers, files=files, data=data)
print(response.json())

This will return a response with the status of your upload and processing:

{
  "status": "success",
  "message": "Files uploaded successfully",
  "file_ids": ["f12345678"],
  "processing_status": "queued"
}

Cloud Integration

Connect to cloud storage providers to ingest files:

import requests
import json

url = "https://vedaya-backend.fly.dev/api/data-ingestion/connect-cloud"

payload = json.dumps({
  "provider": "gdrive",
  "access_token": "YOUR_ACCESS_TOKEN",
  "refresh_token": "YOUR_REFRESH_TOKEN",
  "folder_path": "My Documents/Data",
  "chunk_size": 500,
  "overlap_pct": 10
})

headers = {
  'Authorization': 'Bearer YOUR_API_KEY',
  'Content-Type': 'application/json'
}

response = requests.post(url, headers=headers, data=payload)
print(response.json())

For OAuth-based providers, we provide endpoints to handle the authentication flow. See the Cloud Integration API documentation for details.

Ingesting from URLs

To fetch and process content from a URL:

import requests
import json

url = "https://vedaya-backend.fly.dev/api/data-ingestion/ingest-url"

payload = json.dumps({
  "url": "https://example.com/document.pdf",
  "filename": "example-document",
  "chunk_size": 500,
  "overlap_pct": 10
})

headers = {
  'Authorization': 'Bearer YOUR_API_KEY',
  'Content-Type': 'application/json'
}

response = requests.post(url, headers=headers, data=payload)
print(response.json())

Monitoring Ingestion Status

After uploading files, you can check their processing status:

import requests

file_id = "f12345678"  # The ID returned from upload
url = f"https://vedaya-backend.fly.dev/api/data-ingestion/status/{file_id}"

headers = {
  'Authorization': 'Bearer YOUR_API_KEY'
}

response = requests.get(url, headers=headers)
print(response.json())

Customizing Ingestion Parameters

Chunking Strategy

Chunking controls how documents are split into smaller pieces for processing and retrieval:

chunk_size: Number of characters in each chunk (default: 500)
overlap_pct: Percentage of overlap between chunks (default: 10%)

Larger chunks contain more context but can reduce precision during retrieval. Smaller chunks improve precision but may lose context.

PDF Extraction Options

Vedaya supports multiple PDF extraction methods:

pypdf2: Fast and lightweight (default)
pdfplumber: Better for complex layouts
pymupdf: High performance with advanced features
pymupdf4llm: Optimized for language model processing

Choose based on your document complexity and extraction needs.

Best Practices

File Size: Keep individual files under 20MB for optimal processing
File Types: Supported formats include PDF, DOCX, TXT, CSV, PPTX, and HTML
Batch Processing: For large collections, use multiple requests in parallel
Error Handling: Monitor the status endpoint to catch and retry failed ingestions
Embedding Models: Start with the default model and experiment with others as needed

For more details on available endpoints and parameters, see the Data Ingestion API Reference.

Get Started

Usage

Data Ingestion Guide

Data Ingestion with Vedaya

Overview

Upload Files

Direct Upload

Cloud Integration

Ingesting from URLs

Monitoring Ingestion Status

Customizing Ingestion Parameters

Chunking Strategy

PDF Extraction Options

Best Practices

Get Started

Usage

​Data Ingestion with Vedaya

​Overview

​Upload Files

​Direct Upload

​Cloud Integration

​Ingesting from URLs

​Monitoring Ingestion Status

​Customizing Ingestion Parameters

​Chunking Strategy

​PDF Extraction Options

​Best Practices

Data Ingestion with Vedaya

Overview

Upload Files

Direct Upload

Cloud Integration

Ingesting from URLs

Monitoring Ingestion Status

Customizing Ingestion Parameters

Chunking Strategy

PDF Extraction Options

Best Practices