Bring your own Content

Publisher Content Ingestion

Built a publisher content ingestion pipeline using multi-modal graph agentic workflow from scratch to ingest content from publishers and help them generate questions testing depth of knowledge.

PythonTypeScriptLangGraphAgentic WorkflowMulti ModalFastAPIPostgreSQLMistral AIOpenAI GPT-4oPortkey

THE PROBLEM

Publishers produce thousands of educational worksheets and textbooks across multiple languages and subjects. Extracting structured questions from these PDFs manually is expensive, error-prone, and doesn't scale.

Meanwhile, teachers need fresh, high-quality questions that test genuine depth of knowledge — not just surface-level recall. Generating such questions requires understanding subjects deeply, including formulas, multi-language content, and cognitive complexity levels.

Multi-Language

Support for Gujarati, Spanish, and more

4 Subjects

Maths, Chemistry, Biology, and Physics

Depth of Knowledge

Questions testing cognitive levels 1-5

SYSTEM ARCHITECTURE

FastAPI Application

Health

GET /health

GET /

API Router /api/v1/*

POST /upload

POST /mistral/extract

GET /documents

GET /documents/{id}

POST /documents/publish

Generation /api/v1/gen/*

POST /generate-questions

POST /generate-and-publish

GET /generated-questions

GET /generated-questions/stats

Extraction Pipeline

AgentOrchestrator — LangGraph

Generation Pipeline

GenerationOrchestrator — LangGraph

Shared Services

Mistral OCROpenAI GPT-4oPortkey (Claude)MathSolver (SymPy)KaTeX FormatterImage Analyzer

PostgreSQL Database

Documents | Questions | DocumentImages

EXTRACTION PIPELINE

Extracts structured questions from uploaded educational worksheets and PDFs using an 8-agent sequential LangGraph workflow.

START

PDF input

OCR Extractor

Mistral Document AI — raw text, images, language detection

Structure Analyzer

Document format, question count, chunking strategy

Subject Classifier

Subjects, primary subject, has_formulas flag

has_formulas?

YES: Formula CleanupNO: Skip to Normalizer

Universal Formula Cleanup

GPT-4o — Convert formulas to LaTeX

Text Normalizer

Spacing, blanks, LaTeX formatting fixes

Question Extractor

GPT-4o — Chunked structured question extraction

Language Specialist

Spelling correction, multi-language support

Final Validator

Schema checks, confidence scoring

END

Structured questions + quality report

GENERATION PIPELINE

Generates new questions from textbook/reference content with quality-gated retry. 9 agents + 4 subject-specific modules with a retry loop scoring ≥ 75.

START

PDF input

OCR Extractor

Mistral Document AI

Subject Classifier

Primary subject detection

Language Specialist

Language detection

Content Analyzer

Concepts, patterns, student errors, numerical values

Cognitive Level Planner

Levels 1–5, question types, concept combinations

Question Generator

Physics
Chemistry
Mathematics
Biology

Each uses Portkey + subject-specific prompts

Answer Verifier

SymPy-based mathematical answer verification

Distractor Generator

Plausible wrong MCQ answer options

Quality Validator

Rubric-based scoring — pass threshold ≥ 75

Pass → END
Fail → Retry (≤3)

END

Generated questions + quality scores

AGENT INVENTORY

Extraction-Only Agents

Structure AnalyzerDetect document format, question boundaries, chunking strategy
Universal Formula CleanupConvert formulas to LaTeX via GPT-4o
Text NormalizerFix spacing, blanks, LaTeX formatting
Question ExtractorGPT-4o structured question extraction
Final ValidatorSchema validation + confidence scoring

Generation-Only Agents

Content AnalyzerExtract concepts, patterns, common student errors
Cognitive Level PlannerPlan cognitive levels 1–5 and question types
Question GeneratorRoute to subject-specific generation modules
Answer VerifierSymPy-based mathematical answer verification
Distractor GeneratorGenerate plausible wrong MCQ options
Quality ValidatorRubric-based quality scoring (pass ≥ 75)

Shared Agents (Both Pipelines)

OCR ExtractorMistral Document AI OCR + language detection
Subject ClassifierChemistry/physics/math symbol-based classification
Language SpecialistMulti-language spelling correction

Subject Generation Modules

Physics ModuleMechanics, vectors, energy, dimensional analysis
Chemistry ModulePhysical, organic, inorganic chemistry
Mathematics ModuleAlgebra, calculus, domain/range
Biology ModuleNEET-style recall and reasoning

SERVICES LAYER

LLM Providers

  • MistralService — Document OCR
  • OpenAIService — GPT-4o + Vision
  • PortkeyService — Claude Sonnet, multi-provider routing

Processing

  • MathSolverService — SymPy verification
  • ErrorSolverService — Distractor calculation
  • KaTeXFormatter — LaTeX/KaTeX formatting
  • ImageAnalyzer — GPT-4o Vision
  • FormulaEnrichment — Enrich formulas

Storage & Publishing

  • ImageStorageService — uploads/images/{doc_id}/
  • AgentDocumentService — Extraction graph runner
  • SchemaConverterService — Internal to Quizizz schema
  • QuizizzPublisherService — Publish to platform

DATABASE SCHEMA

documents

PKid
filename
raw_text
processed_text
context_json
quality_score
status
created_at
updated_at

questions

PKid
FKdocument_id
question_data_json
created_at

document_images

PKid
FKdocument_id
image_path
image_type
created_at

STATE FLOW

DocumentState (Extraction)

pdf_bytes
raw_text
document_structure
subjects
has_formulas
formula_conversions
normalized_text
final_text
questions[]
quality_report
overall_confidence

QuestionGenerationState

pdf_bytes
raw_text
primary_subject
extracted_concepts
key_topics
problem_solving_patterns
common_student_errors
question_plan
target_complexity_distribution
generated_questions[]
quality_scores
validation_passed

PIPELINE COMPARISON

AspectExtraction PipelineGeneration Pipeline
PurposeExtract existing questions from worksheetsGenerate new questions from reference content
Entry PointPOST /api/v1/mistral/extractPOST /api/v1/gen/generate-questions
OrchestratorAgentOrchestratorGenerationOrchestrator
Agents8 sequential agents9 agents + 4 subject modules
LLM UsageMistral (OCR) + GPT-4o (extraction)Mistral (OCR) + Portkey/Claude (gen) + SymPy
RoutingFormula cleanup bypass for non-scienceQuality gate retry loop (≤3, score ≥ 75)
OutputStructured questions from document textNovel questions with verified answers & distractors
Back to all Projects

Built with Python, LangGraph, FastAPI, and PostgreSQL