Publisher Content Ingestion
Built a publisher content ingestion pipeline using multi-modal graph agentic workflow from scratch to ingest content from publishers and help them generate questions testing depth of knowledge.
THE PROBLEM
Publishers produce thousands of educational worksheets and textbooks across multiple languages and subjects. Extracting structured questions from these PDFs manually is expensive, error-prone, and doesn't scale.
Meanwhile, teachers need fresh, high-quality questions that test genuine depth of knowledge — not just surface-level recall. Generating such questions requires understanding subjects deeply, including formulas, multi-language content, and cognitive complexity levels.
Multi-Language
Support for Gujarati, Spanish, and more
4 Subjects
Maths, Chemistry, Biology, and Physics
Depth of Knowledge
Questions testing cognitive levels 1-5
SYSTEM ARCHITECTURE
FastAPI Application
Health
GET /health
GET /
API Router /api/v1/*
POST /upload
POST /mistral/extract
GET /documents
GET /documents/{id}
POST /documents/publish
Generation /api/v1/gen/*
POST /generate-questions
POST /generate-and-publish
GET /generated-questions
GET /generated-questions/stats
Extraction Pipeline
AgentOrchestrator — LangGraph
Generation Pipeline
GenerationOrchestrator — LangGraph
Shared Services
PostgreSQL Database
Documents | Questions | DocumentImages
EXTRACTION PIPELINE
Extracts structured questions from uploaded educational worksheets and PDFs using an 8-agent sequential LangGraph workflow.
START
PDF input
OCR Extractor
Mistral Document AI — raw text, images, language detection
Structure Analyzer
Document format, question count, chunking strategy
Subject Classifier
Subjects, primary subject, has_formulas flag
has_formulas?
Universal Formula Cleanup
GPT-4o — Convert formulas to LaTeX
Text Normalizer
Spacing, blanks, LaTeX formatting fixes
Question Extractor
GPT-4o — Chunked structured question extraction
Language Specialist
Spelling correction, multi-language support
Final Validator
Schema checks, confidence scoring
END
Structured questions + quality report
GENERATION PIPELINE
Generates new questions from textbook/reference content with quality-gated retry. 9 agents + 4 subject-specific modules with a retry loop scoring ≥ 75.
START
PDF input
OCR Extractor
Mistral Document AI
Subject Classifier
Primary subject detection
Language Specialist
Language detection
Content Analyzer
Concepts, patterns, student errors, numerical values
Cognitive Level Planner
Levels 1–5, question types, concept combinations
Question Generator
Each uses Portkey + subject-specific prompts
Answer Verifier
SymPy-based mathematical answer verification
Distractor Generator
Plausible wrong MCQ answer options
Quality Validator
Rubric-based scoring — pass threshold ≥ 75
END
Generated questions + quality scores
AGENT INVENTORY
Extraction-Only Agents
Generation-Only Agents
Shared Agents (Both Pipelines)
Subject Generation Modules
SERVICES LAYER
LLM Providers
- MistralService — Document OCR
- OpenAIService — GPT-4o + Vision
- PortkeyService — Claude Sonnet, multi-provider routing
Processing
- MathSolverService — SymPy verification
- ErrorSolverService — Distractor calculation
- KaTeXFormatter — LaTeX/KaTeX formatting
- ImageAnalyzer — GPT-4o Vision
- FormulaEnrichment — Enrich formulas
Storage & Publishing
- ImageStorageService — uploads/images/{doc_id}/
- AgentDocumentService — Extraction graph runner
- SchemaConverterService — Internal to Quizizz schema
- QuizizzPublisherService — Publish to platform
DATABASE SCHEMA
documents
questions
document_images
STATE FLOW
DocumentState (Extraction)
QuestionGenerationState
PIPELINE COMPARISON
| Aspect | Extraction Pipeline | Generation Pipeline |
|---|---|---|
| Purpose | Extract existing questions from worksheets | Generate new questions from reference content |
| Entry Point | POST /api/v1/mistral/extract | POST /api/v1/gen/generate-questions |
| Orchestrator | AgentOrchestrator | GenerationOrchestrator |
| Agents | 8 sequential agents | 9 agents + 4 subject modules |
| LLM Usage | Mistral (OCR) + GPT-4o (extraction) | Mistral (OCR) + Portkey/Claude (gen) + SymPy |
| Routing | Formula cleanup bypass for non-science | Quality gate retry loop (≤3, score ≥ 75) |
| Output | Structured questions from document text | Novel questions with verified answers & distractors |
Built with Python, LangGraph, FastAPI, and PostgreSQL