In today’s world, companies generate mountains of documents daily. Finding a single file among shared network drives, cloud storage, and local folders can feel like searching for a needle in a haystack. Fortunately, with tools like Ollama—a local AI model—you can build a powerful, privacy-friendly search engine tailored for your company’s needs.
This guide will provide an in-depth, step-by-step walkthrough for setting up a local AI-powered search engine capable of indexing and searching documents across your company’s network drives, cloud storage services, and local file systems. We’ll cover everything from preparing your hardware and software environment to setting up document parsing, building an AI-driven search backend, creating a user-friendly web interface, and automating document indexing. Whether you’re managing a small business or a large enterprise, this guide will help you build a reliable, privacy-focused search solution that integrates seamlessly with your existing infrastructure.
Step 1: Understand the Requirements
Before we jump into the technical setup, it’s important to ensure your system and environment are properly prepared. Here’s a detailed breakdown of what you’ll need:
- Hardware Requirements: A local machine with sufficient processing power and memory. At least 16GB RAM is strongly recommended, and having an SSD will speed up indexing operations.
- Operating System: Linux (preferred), macOS, or Windows.
- Ollama: This platform will be used for running AI models locally. Ensure it’s installed and functional.
- Python Environment: Python 3.8 or higher installed, along with a virtual environment for dependency management.
- Document Parsing Tool: We’ll use Apache Tika for extracting text from various document formats.
- Database: SQLite for lightweight, local data storage. For larger datasets, PostgreSQL or MySQL can also be considered.
- Network Access: Ensure your local machine can access the network drives or cloud storage where documents are stored.
- Basic Python Knowledge: Familiarity with Python scripting, especially in file handling and database interactions.
With these requirements in place, you’ll be ready to move on to the installation and setup process.
Step 2: Install Required Tools
What is Ollama?
Ollama is an open-source tool designed to run large language models (LLMs) locally on your computer or server. Instead of relying on cloud-based AI services, Ollama processes everything on your hardware. This makes it an ideal choice for organizations that prioritize data privacy, offline access, and cost control.
Why Use Ollama?
- Data Privacy: Your data stays on your machine, reducing the risk of leaks or third-party access.
- Speed: No internet latency. Responses are faster because everything happens locally.
- Cost-Efficient: No recurring costs for cloud-based AI services.
- Flexibility: Supports various language models, including fine-tuned ones tailored to specific tasks.
- Offline Access: Perfect for environments without reliable internet connections.
How Does Ollama Work?
At its core, Ollama serves as a lightweight interface for running LLMs like LLaMA (Large Language Model Meta AI). It can handle natural language queries, text generation, and complex analysis tasks entirely on your system.
For example, if you want to ask a question about your documents, Ollama processes your query, compares it to the indexed text, and provides a meaningful answer.
Here’s a simple example of running an AI query with Ollama:
import ollama
response = ollama.chat(model='llama2', messages=[
{'role': 'user', 'content': 'Summarize the key points of this document.'}
])
print(response['message']['content'])
This code sends a message to an AI model running locally via Ollama and returns a text response.
By integrating Ollama with tools like Apache Tika for document parsing and SQLite for data storage, you can create a powerful, self-contained AI search engine.
2.1 Install Ollama
Ollama allows you to run AI models locally without relying on the cloud. It serves as the backbone of your search engine, handling natural language processing and search queries.
Installing Ollama on Linux/MacOS:
# Download and install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Verify the installation
ollama run llama2
This command tests the installation by running a sample model.
Installing Ollama on Windows:
Visit the official Ollama website and download the Windows installer. Follow the on-screen instructions.
2.2 Set Up a Virtual Environment
It’s good practice to isolate project dependencies.
python3 -m venv ollama_env
source ollama_env/bin/activate # On Windows use: ollama_env\Scripts\activate
2.3 Install Required Python Libraries
We’ll need the following libraries for parsing, database interaction, and AI communication:
pip install apache-tika sqlite3 ollama requests flask
- apache-tika: Extracts text from documents.
- sqlite3: Manages document indexing and storage.
- ollama: Connects your Python script with the Ollama AI model.
- requests: Handles network communication.
- flask: Builds a user-friendly web interface.
2.4 Verify Installations
Run the following commands to ensure everything is installed correctly:
ollama --version
python --version
pip list | grep ollama
If all checks pass, you’re ready for the next step.
Step 3: Set Up Document Parsing and Indexing
We’ll use Apache Tika to extract text from various document formats (PDFs, DOCX, TXT, etc.).
Apache Tika is a versatile content analysis toolkit used to extract text and metadata from various document formats, including PDFs, DOCX, TXT, and more. Essentially, it acts as a bridge between raw documents and usable, searchable text data.
What is Apache Tika?
Apache Tika is an open-source library developed by the Apache Software Foundation. Its primary role is to parse and extract text, metadata, and structured content from diverse document formats. It supports hundreds of file types, including PDFs, Word documents, Excel spreadsheets, plain text files, and even multimedia files.
Why Use Apache Tika?
- Versatility: It supports a wide range of document formats, reducing the need for multiple parsing tools.
- Automation: It can be easily integrated into scripts to automate document processing.
- Scalability: Suitable for both small projects and large-scale enterprise systems.
- Search Integration: Extracted text can be fed into databases or AI models for search and analysis.
- Open Source: No licensing costs make it accessible for companies of all sizes.
- Metadata Extraction: Apart from text, it can extract metadata like author names, document creation dates, and more.
In this guide, Tika serves as the backbone for transforming your company’s document repository into searchable, AI-ready data.
3.1 Create a Parsing Script
Create a Python script called index_documents.py
:
import os
from tika import parser
import sqlite3
# Database setup
conn = sqlite3.connect('documents.db')
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS documents (id INTEGER PRIMARY KEY, name TEXT, content TEXT)''')
# Document indexing
def index_documents(folder_path):
for root, dirs, files in os.walk(folder_path):
for file in files:
file_path = os.path.join(root, file)
print(f'Indexing: {file_path}')
raw = parser.from_file(file_path)
content = raw.get('content', '')
c.execute('INSERT INTO documents (name, content) VALUES (?, ?)', (file, content))
conn.commit()
# Example usage
index_documents('/path/to/company/documents')
conn.close()
Run the script to index your documents:
python index_documents.py
This script extracts text from files in the specified folder and stores them in a database.
Step 4: Build the Search Functionality with Ollama
4.1 Search Backend Script
Create a new Python script called search_documents.py
:
import sqlite3
import ollama
def search_documents(query):
conn = sqlite3.connect('documents.db')
c = conn.cursor()
c.execute('SELECT name, content FROM documents')
documents = c.fetchall()
combined_text = '\n'.join([f'{name}: {content}' for name, content in documents])
prompt = f"Search for: {query} in the following documents:\n{combined_text}"
response = ollama.chat(model='llama2', messages=[{'role': 'user', 'content': prompt}])
print(response['message']['content'])
# Example usage
search_documents('Project X budget report')
Run the search query:
python search_documents.py
Continue building the user-friendly interface, automating indexing, and securing your setup. Each step will ensure your search engine remains efficient, secure, and scalable.
User-Friendly Interface: A web interface (e.g., using Flask) makes the search engine accessible to non-technical team members. Users can perform searches via a browser instead of running Python scripts.
Automating Indexing: Scheduling regular indexing tasks (e.g., using cron
on Linux or Task Scheduler on Windows) ensures newly added documents are always searchable without manual intervention.
Securing the Setup: Implementing authentication, encryption, and controlled access prevents unauthorized users from accessing sensitive documents or search results.
Justyna Flisk
Justyna Flisk, Senior Editor at Decoded.cc, combines her expertise as a Senior Software Engineer and AI R&D Manager to deliver sharp, forward-thinking content on technology and artificial intelligence. With a passion for innovation, Justyna bridges the gap between technical depth and clear storytelling, ensuring readers stay ahead in the fast-evolving AI landscape.
Leave a Reply