Geen omschrijving

tiago.cipriano b5ae6f1934 chore: create first api version to rag		2 dagen geleden
arquivos	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
.dockerignore	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
.tool-versions	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
Dockerfile	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
README.md	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
api.ts	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
bun.lock	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
busca.ts	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
database.ts	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
embeddings.sqlite	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
index.ts	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
insert-embeddings.ts	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
package.json	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
tsconfig.json	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
types.ts	b5ae6f1934 chore: create first api version to rag	2 dagen geleden
utils.ts	b5ae6f1934 chore: create first api version to rag	2 dagen geleden

Academic Database RAG System

A Retrieval-Augmented Generation (RAG) system designed for academic document search and retrieval. This system uses vector embeddings to enable semantic search across academic content, providing relevant context for AI-powered question answering.

Features

Vector Search: Semantic search using cosine similarity on document embeddings
Document Ingestion: Automatic processing and chunking of Markdown documents
REST API: HTTP API for querying similar documents
SQLite Vector Database: Efficient storage of documents and their embeddings
Ollama Integration: Uses Ollama's embedding models for generating vector representations
Docker Support: Containerized deployment with persistent storage
TypeScript: Fully typed codebase with modern JavaScript runtime (Bun)

Architecture

The system consists of several key components:

Document Processing (insert-embeddings.ts): Reads Markdown files, chunks content, generates embeddings using Ollama, and stores in SQLite
Vector Database (database.ts): SQLite-based storage for documents and their vector embeddings
Search Engine (busca.ts): Performs semantic search using cosine similarity
API Server (api.ts): RESTful API for querying the system
Utilities (utils.ts): Helper functions for similarity calculations

Prerequisites

Bun runtime (v1.3.6 or later)
Ollama with embedding model (nomic-embed-text:latest)
SQLite (automatically handled by Bun)

Installation

Clone the repository

git clone <repository-url>
cd base-de-dados-academia

Install dependencies
```
bun install
```

Start Ollama and pull the embedding model

ollama serve
ollama pull nomic-embed-text:latest

Usage

Running the API Server

Start the REST API server:

bun run api.ts

The server will start on port 3000 (configurable via PORT environment variable).

Adding Documents

Place new Markdown (.md) files in the arquivos/novos/ directory
Run the document ingestion script:
```
bun run insert-embeddings.ts
```

This will:

Process all .md files in arquivos/novos/
Generate embeddings for document chunks
Store them in the vector database
Move processed files to arquivos/processados/
Move failed files to arquivos/erro/

Testing the Search

Run the example search script:

bun run index.ts

This demonstrates searching for similar documents to a sample query.

API Reference

POST /api/embeddings

Search for documents similar to a given prompt.

Request Body:

{
  "prompt": "What is the suffix of a markdown file?",
  "topK": 3,
  "limiarSimilaridade": 0.5
}

Parameters:

prompt (required): The search query text
topK (optional): Number of top results to return (default: 3)
limiarSimilaridade (optional): Minimum similarity threshold (default: 0.5)

Response:

{
  "contexto": "Documentos relevantes:\n\n--- Documento 1: file.md (similaridade: 0.85) ---\nContent...",
  "resultados": [
    {
      "documento": {
        "nome": "file.md",
        "caminho": "/path/to/file.md",
        "conteudo": "Content...",
        "tamanho": 1234,
        "embedding": [0.1, 0.2, ...]
      },
      "similaridade": 0.85
    }
  ]
}

Example using curl:

curl -X POST http://localhost:3000/api/embeddings \
  -H "Content-Type: application/json" \
  -d '{"prompt": "on docker?", "topK": 3, "limiarSimilaridade": 0.5}'

Configuration

The system can be configured using environment variables:

PORT: API server port (default: 3000)
DB_PATH: Path to SQLite database file (default: ./embeddings.sqlite)
OLLAMA_BASE_URL: Ollama API endpoint (default: http://localhost:11434)

Docker Deployment

Build the Image

docker build --pull -t rag-academia-server .

Run the Container

docker run \
  --restart=always \
  -v $(pwd)/embeddings.sqlite:/tmp/embeddings.sqlite \
  --name rag-academia-server \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -e DB_PATH=/tmp/embeddings.sqlite \
  -e PORT=3000 \
  --network=host \
  -d \
  rag-academia-server

Notes:

Mount the database file as a volume to persist data
Use host.docker.internal to access Ollama running on the host
Adjust OLLAMA_BASE_URL if Ollama is running in a separate container

How It Works

Document Ingestion:
- Markdown files are read from arquivos/novos/
- Content is split into chunks (~2000 characters)
- Each chunk is converted to a vector embedding using Ollama
- Embeddings are stored in SQLite with metadata
Query Processing:
- User query is converted to an embedding
- Cosine similarity is calculated against all stored embeddings
- Top-K most similar documents are returned
- Results are formatted into a context string for LLM consumption
Similarity Calculation:
- Uses cosine similarity: cos(θ) = (A · B) / (||A|| × ||B||)
- Values range from -1 to 1, where higher values indicate greater similarity

Project Structure

├── api.ts                 # REST API server
├── busca.ts               # Search and similarity functions
├── database.ts            # SQLite vector database operations
├── index.ts               # Example usage script
├── insert-embeddings.ts   # Document ingestion pipeline
├── types.ts               # TypeScript type definitions
├── utils.ts               # Utility functions
├── arquivos/              # Document storage
│   ├── novos/            # New documents to process
│   ├── processados/      # Successfully processed documents
│   └── erro/             # Failed processing documents
├── prompts/              # Prompt templates
├── Dockerfile            # Container configuration
├── package.json          # Dependencies and scripts
└── tsconfig.json         # TypeScript configuration

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

README.md