How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches
Introduction
Extracting structured data from B2B documents—such as purchase orders, invoices, or delivery notes—is a common challenge. Two primary approaches exist: a traditional rule-based method using pytesseract for OCR and regex for parsing, and a modern LLM-based method using Ollama with LLaMA 3. This guide walks you through building both versions of the same document extractor, comparing their strengths and tradeoffs using a realistic B2B order scenario. By the end, you'll be able to choose the right approach for your own projects.

What You Need
- Python 3.8+ installed on your machine
- pytesseract – Python wrapper for Tesseract OCR engine
- Tesseract OCR engine installed separately (see Tesseract OCR documentation)
- Ollama – local LLM server (download from ollama.com)
- LLaMA 3 model (run
ollama pull llama3after installing Ollama) - Python libraries:
pdf2image,Pillow,re,requests - A sample B2B PDF (e.g., a purchase order with fields: company name, date, line items, totals)
Step-by-Step Instructions
Step 1: Set Up the Environment
Create a new Python virtual environment and install all required packages:
pip install pytesseract pdf2image Pillow requests
Ensure Tesseract OCR is installed globally (sudo apt install tesseract-ocr on Linux, or download the Windows installer). Also install and start Ollama, then pull the LLaMA 3 model:
ollama pull llama3
Step 2: Convert PDF to Images
B2B documents are often scanned PDFs. Use pdf2image to turn each page into a PNG image. Write a function that:
- Takes the PDF path as input
- Converts pages to images using
convert_from_path - Returns a list of PIL Image objects
Step 3: Perform OCR with pytesseract
For each image, call pytesseract.image_to_string() to extract raw text. This step is identical for both rule-based and LLM approaches, as they both need the text first. Store the extracted text per page.
Step 4: Build the Rule-Based Extractor
Use regular expressions and string logic to locate fields like Order Number, Date, Client Name, and Line Items. For example:
- Search for patterns like
r'Order\s*#:\s*(\S+)' - Use a list of known product names for line items
- Parse multi-line blocks for tables
This method is fast and predictable, but fragile if the document format changes.

Step 5: Build the LLM-Based Extractor
Instead of writing rules, send the extracted text to LLaMA 3 via Ollama’s API. Send a structured prompt that asks the model to extract specific fields in JSON format:
prompt = f"""
Extract the following information from this purchase order:
- order_number
- date
- client_name
- line_items (array of objects with 'item', 'quantity', 'price')
Return only valid JSON.
Text:
{text}
"""
Use the requests library to call Ollama:
response = requests.post('http://localhost:11434/api/generate', json={'model':'llama3', 'prompt':prompt, 'stream':False})
Parse the JSON from the response.
Step 6: Compare Outputs
Run both extractors on the same set of PDFs and compare:
- Accuracy: Which fields are correct?
- Robustness: How does each handle missing data or typos?
- Speed: Rule-based usually finishes in seconds; LLM may take 10–30 seconds per page.
The original experiment showed that the rule-based approach failed on a slightly different document format, while the LLM gracefully adapted—but hallucinated one item.
Tips for Success
- Preprocess images: For rule-based OCR, apply thresholding or deskewing to improve accuracy.
- Optimize LLM prompts: Include example outputs and specify format clearly to reduce hallucinations.
- Fallback strategy: Use rule-based extraction for well-known templates and LLM as a fallback for unknown documents.
- Test with diverse samples: Don’t rely on a single document; vary fonts, layouts, and printing quality.
- Monitor costs: Local LLMs are free but require GPU; cloud LLMs charge per token.
By following these steps, you can build your own B2B document extractor and decide which approach best fits your needs. For a deep dive into the original comparison, see the full article.
Related Articles
- Netflix's Must-Watch Blockbusters: May 4–10 Guide
- 7 Game-Changing Features of the Data Wrangler Notebook Results Table You Need to Know
- 10 Key Insights into the American Dream: A Guide to Building a Fair Future
- 10 Hidden Gems of PowerToys FancyZones That Will Transform Your Window Management
- A Deadly Virus on the High Seas: Unpacking the Hantavirus Outbreak Aboard a Cruise Ship
- 8 Ways SUSE is Building the Open Infrastructure Layer for the AI Era
- Cooper Union Talk to Reexamine American Dream Amid 2025 Challenges
- How to Use JetStream 3 for Modern Web Performance Testing