TLDR: I’m using RAG techniques to mold an already prompt injected model to search indexed SUPPORTED_EXTENSIONS = (‘.pdf’, ‘.docx’, ‘.txt’, ‘.md’, ‘.html’, ‘.htm’) files using LlamaIndex + Ollama. The full index should be complete in about 2 weeks. I used libreoffice to convert older .doc into .docx using this method: sudo apt install libreoffice -y find “/DOCUMENTS/PATH/” -name “*.doc” -exec libreoffice --headless --convert-to docx } --outdir {}_converted ; I used tesseract-ocr & imagemagick to convert all (.jpg/.png/.bmp) into OCR pdf’s using this method: sudo apt install tesseract-ocr imagemagick -y pip install pytesseract pillow find “/DOCUMENTS/PATH/” -name “.jpg" -o -name ".png” -o -name “*.bmp” " pdf done Here’s the full process from scratch: Install Python venv if you don’t have it: sudo apt install python3-full python3-venv -y Create a folder for your project: mkdir rag-project cd rag-project Create a folder for your documents: mkdir your_docs Drop whatever files you want the model to learn from into this folder. Skip this if you already have them in a certain folder. Create the virtual environment: python3 -m venv rag-env Activate it: source rag-env/bin/activate You should see (rag-env) at the start of your terminal prompt. Install the packages: pip install llama-index llama-index-llms-ollama llama-index-embeddings-ollama Pull the embedding model in Ollama: ollama pull nomic-embed-text Create the script: *You can use your preferred method of editing here nano is a little hard to navigate for the uninitiated. * nano rag.py Paste the following script in rag.py: import os import shutil import subprocess from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage from llama_index.llms.ollama import Ollama from llama_index.embeddings.ollama import OllamaEmbedding from llama_index.core import Settings # — UPDATE THE TOP 2 — DOCS_PATH = “/DOCUMENTS/PATH/” MODEL_NAME = “MODEL_NAME” EMBED_MODEL = “nomic-embed-text” INDEX_PATH = “./saved_index” TEMP_INDEX_PATH = “./saved_index_tmp” BATCH_SIZE = 5 MAX_FILE_SIZE_MB = 50 SUPPORTED_EXTENSIONS = (‘.pdf’, ‘.docx’, ‘.txt’, ‘.md’, ‘.html’, ‘.htm’) # -------------------- Settings.llm = Ollama(model=MODEL_NAME, request_timeout=120.0) Settings.embed_model = OllamaEmbedding(model_name=EMBED_MODEL) # Get all files recursively all_files = [] skipped_size = 0 skipped_type = 0 for root, dirs, files in os.walk(DOCS_PATH): for file in files: full_path = os.path.join(root, file) if not file.lower().endswith(SUPPORTED_EXTENSIONS): skipped_type += 1 continue size_mb = os.path.getsize(full_path) / (1024 * 1024) if size_mb > MAX_FILE_SIZE_MB: skipped_size += 1 continue all_files.append(full_path) total_files = len(all_files) total_batches = (total_files + BATCH_SIZE - 1) // BATCH_SIZE print(f"Found total_files} files to index.“) print(f"Skipped {skipped_size} files over {MAX_FILE_SIZE_MB}MB.”) print(f"Skipped {skipped_type} unsupported file types.“) print(f"Total batches: {total_batches}\n”) # Load or create index if os.path.exists(INDEX_PATH): print(“Loading existing index…”) storage_context = StorageContext.from_defaults(persist_dir=INDEX_PATH) index = load_index_from_storage(storage_context) else: print(“Creating new index…”) index = None # Overall progress bar def overall_progress(current, total, bar_length=40): percent = current / total if total > 0 else 0 filled = int(bar_length * percent) bar = ‘█’ * filled + ‘░’ * (bar_length - filled) files_done = min(current * BATCH_SIZE, total_files) total_str = str(total) current_str = str(current).rjust(len(total_str)) files_done_str = str(files_done).rjust(len(str(total_files))) print(f’\rOverall: [{bar}] {current_str}/{total_str} batches /total_files} files %‘, end=’', flush=True) # Safe save function def safe_save(index): index.storage_context.persist(persist_dir=TEMP_INDEX_PATH) if os.path.exists(INDEX_PATH): shutil.rmtree(INDEX_PATH) shutil.copytree(TEMP_INDEX_PATH, INDEX_PATH) shutil.rmtree(TEMP_INDEX_PATH) # Process in batches completed = 0 for i in range(0, total_files, BATCH_SIZE): batch = all_files[i:i + BATCH_SIZE] overall_progress(completed, total_batches) try: documents = [] for file in batch: documents += SimpleDirectoryReader(input_files=[file]).load_data() if index is None: index = VectorStoreIndex.from_documents(documents) else: for doc in documents: index.insert(doc) safe_save(index) completed += 1 except Exception as e: print(f"\nSkipping batch due to error: {e}“) completed += 1 continue overall_progress(total_batches, total_batches) print(f”\n\nAll done! {total_files} files indexed.\n") # Query engine with fallback query_engine = index.as_query_engine( similarity_top_k=3, response_mode=“compact” ) print(“Ready! Type your questions (ctrl+c to quit)\n”) while True: question = input(“You: “) try: response = query_engine.query(question) if not str(response).strip() or “empty response” in str(response).lower(): raise ValueError(“No relevant docs found”) print(f”\nAssistant: {response}\n”) except: result = subprocess.run( [“ollama”, “run”, “MODEL_NAME”], input=question, capture_output=True, text=True, env={**os.environ, “OLLAMA_NOHISTORY”: “1”} ) print(f"\nAssistant: {result.stdout.strip()}\n") UPDATE THE FOLLOWING IN BOLD: DOCS_PATH = " /DOCUMENTS/PATH/ " MODEL_NAME = " MODEL_NAME " & THE MODEL_NAME ON THIS LINE TOWARDS THE BOTTOM OF THE SCRIPT: [“ollama”, “run”, “MODEL_NAME”], THEN SAVE WITH Ctrl+O, Enter, Ctrl+X. (If you’re using nano.) RUN THE RAG PROJECT USING THE FOLLOWING COMMAND: python rag.py Next time you want to use it: cd rag-project source rag-env/bin/activate python rag.py Included in the script is an env variable (env={**os.environ, “OLLAMA_NOHISTORY”: “1”}) this allows sessions to log zero history for war crimes. Also included in the script is batch limits (5) so after every batch it will save state and pick up where it left off if cancelled or runs out of memory to complete the task. I also set the file size limit to 50MB for the first pass leaving the larger PDF’s for last I’m going to edit the script to do batches of (1) because it requires a ton of RAM to complete these indexing tasks. Batch size 5 seems to be the sweet spot for me, if you’re running 32GB or 64GB you can probably get away with (10) and (20-25) respectively. This is all fairly new to me so all credit for the heavy lifting coding goes to Claude. So far my model is running uninhibited with just the injection utilizing a modelfile, but once this is finished it’ll be equipped with the past 25 years of my incessant doc hoarding problem. I’m creating a monster. submitted by /u/prozak4kidz
Originally posted by u/prozak4kidz on r/ArtificialInteligence
