Bring your own Content

Publisher Content Ingestion

Built a publisher content ingestion pipeline using multi-modal graph agentic workflow from scratch to ingest content from publishers and help them generate questions testing depth of knowledge.

PythonTypeScriptLangGraphAgentic WorkflowMulti ModalFastAPIPostgreSQLMistral AIOpenAI GPT-4oPortkey

THE PROBLEM

Publishers produce thousands of educational worksheets and textbooks across multiple languages and subjects. Extracting structured questions from these PDFs manually is expensive, error-prone, and doesn't scale.

Meanwhile, teachers need fresh, high-quality questions that test genuine depth of knowledge — not just surface-level recall. Generating such questions requires understanding subjects deeply, including formulas, multi-language content, and cognitive complexity levels.

Multi-Language

Support for Gujarati, Spanish, and more

4 Subjects

Maths, Chemistry, Biology, and Physics

Depth of Knowledge

Questions testing cognitive levels 1-5

SYSTEM ARCHITECTURE

FastAPI Application

Health

GET /health

GET /

API Router /api/v1/*

POST /upload

POST /mistral/extract

GET /documents

GET /documents/{id}

POST /documents/publish

Generation /api/v1/gen/*

POST /generate-questions

POST /generate-and-publish

GET /generated-questions

GET /generated-questions/stats

Extraction Pipeline

AgentOrchestrator — LangGraph

Generation Pipeline

GenerationOrchestrator — LangGraph

Shared Services

Mistral OCROpenAI GPT-4oPortkey (Claude)MathSolver (SymPy)KaTeX FormatterImage Analyzer

PostgreSQL Database

Documents | Questions | DocumentImages

EXTRACTION PIPELINE

Extracts structured questions from uploaded educational worksheets and PDFs using an 8-agent sequential LangGraph workflow.

START

PDF input

OCR Extractor

Mistral Document AI — raw text, images, language detection

Structure Analyzer

Document format, question count, chunking strategy

Subject Classifier

Subjects, primary subject, has_formulas flag

has_formulas?

YES: Formula CleanupNO: Skip to Normalizer

Universal Formula Cleanup

GPT-4o — Convert formulas to LaTeX

Text Normalizer

Spacing, blanks, LaTeX formatting fixes

Question Extractor

GPT-4o — Chunked structured question extraction

Language Specialist

Spelling correction, multi-language support

Final Validator

Schema checks, confidence scoring

END

Structured questions + quality report

GENERATION PIPELINE

Generates new questions from textbook/reference content with quality-gated retry. 9 agents + 4 subject-specific modules with a retry loop scoring ≥ 75.

START

PDF input

OCR Extractor

Mistral Document AI

Subject Classifier

Primary subject detection

Language Specialist

Language detection

Content Analyzer

Concepts, patterns, student errors, numerical values

Cognitive Level Planner

Levels 1–5, question types, concept combinations

Question Generator

Physics

Chemistry

Mathematics

Biology

Each uses Portkey + subject-specific prompts

Answer Verifier

SymPy-based mathematical answer verification

Distractor Generator

Plausible wrong MCQ answer options

Quality Validator

Rubric-based scoring — pass threshold ≥ 75

Pass → END

Fail → Retry (≤3)

END

Generated questions + quality scores

AGENT INVENTORY

Extraction-Only Agents

Structure AnalyzerDetect document format, question boundaries, chunking strategy

Universal Formula CleanupConvert formulas to LaTeX via GPT-4o

Text NormalizerFix spacing, blanks, LaTeX formatting

Question ExtractorGPT-4o structured question extraction

Final ValidatorSchema validation + confidence scoring

Generation-Only Agents

Content AnalyzerExtract concepts, patterns, common student errors

Cognitive Level PlannerPlan cognitive levels 1–5 and question types

Question GeneratorRoute to subject-specific generation modules

Answer VerifierSymPy-based mathematical answer verification

Distractor GeneratorGenerate plausible wrong MCQ options

Quality ValidatorRubric-based quality scoring (pass ≥ 75)

Shared Agents (Both Pipelines)

OCR ExtractorMistral Document AI OCR + language detection

Subject ClassifierChemistry/physics/math symbol-based classification

Language SpecialistMulti-language spelling correction

Subject Generation Modules

Physics ModuleMechanics, vectors, energy, dimensional analysis

Chemistry ModulePhysical, organic, inorganic chemistry

Mathematics ModuleAlgebra, calculus, domain/range

Biology ModuleNEET-style recall and reasoning

SERVICES LAYER

LLM Providers

MistralService — Document OCR
OpenAIService — GPT-4o + Vision
PortkeyService — Claude Sonnet, multi-provider routing

Processing

MathSolverService — SymPy verification
ErrorSolverService — Distractor calculation
KaTeXFormatter — LaTeX/KaTeX formatting
ImageAnalyzer — GPT-4o Vision
FormulaEnrichment — Enrich formulas

Storage & Publishing

ImageStorageService — uploads/images/{doc_id}/
AgentDocumentService — Extraction graph runner
SchemaConverterService — Internal to Quizizz schema
QuizizzPublisherService — Publish to platform

DATABASE SCHEMA

documents

PKid

filename

raw_text

processed_text

context_json

quality_score

status

created_at

updated_at

questions

PKid

FKdocument_id

question_data_json

created_at

document_images

PKid

FKdocument_id

image_path

image_type

created_at

STATE FLOW

DocumentState (Extraction)

pdf_bytes

raw_text

document_structure

subjects

has_formulas

formula_conversions

normalized_text

final_text

questions[]

quality_report

overall_confidence

QuestionGenerationState

pdf_bytes

raw_text

primary_subject

extracted_concepts

key_topics

problem_solving_patterns

common_student_errors

question_plan

target_complexity_distribution

generated_questions[]

quality_scores

validation_passed

PIPELINE COMPARISON

Aspect	Extraction Pipeline	Generation Pipeline
Purpose	Extract existing questions from worksheets	Generate new questions from reference content
Entry Point	POST /api/v1/mistral/extract	POST /api/v1/gen/generate-questions
Orchestrator	AgentOrchestrator	GenerationOrchestrator
Agents	8 sequential agents	9 agents + 4 subject modules
LLM Usage	Mistral (OCR) + GPT-4o (extraction)	Mistral (OCR) + Portkey/Claude (gen) + SymPy
Routing	Formula cleanup bypass for non-science	Quality gate retry loop (≤3, score ≥ 75)
Output	Structured questions from document text	Novel questions with verified answers & distractors

Back to all Projects

Built with Python, LangGraph, FastAPI, and PostgreSQL