Prep
Content · May 20, 2026

How we ship 5 000 MCQs a week without losing quality

The AI + SME + reviewer pipeline, the lint rules that catch 90% of bad questions, and the one rule that catches the last 10%.

M
M. Joseph
Head of Content
12 min read May 20, 2026

You can't hand-write 100,000 MCQs to seed a global exam-prep platform. We tried. It scales to about 300 questions per writer per week — at $5/question that's $50k for a complete NEET bank, with a 6-month timeline. We needed faster + cheaper, and quality couldn't drop.

Here's the pipeline we settled on. It produces ~5,000 MCQs per week at a defect rate (questions later flagged + revised) of about 4%. Reviewers see two-thirds of their week vanish on auto-rejected drafts before they look at anything.

1 · The funnel

Every MCQ flows through five stages:

· <b>Generate</b> — LLM prompt with the syllabus topic, difficulty, source-PYQ pattern. Outputs 100 candidates per prompt. · <b>Lint</b> — automated rules reject ~70% on the spot. Distractor length parity, banned-pattern check, duplicate-stem similarity vs existing bank, answer-position balance, source-attribution presence. · <b>SME review</b> — human subject-matter expert reviews the remaining ~30. ~80% of those are accepted (with minor edits) and 20% rejected. · <b>Cohort calibration</b> — published MCQs run through a 100-attempt anonymous calibration period. Difficulty + discrimination computed. If accuracy < 5% or > 95%, the MCQ is quarantined for re-review. · <b>Public</b> — only after cohort calibration does the MCQ become part of the default question bank.

2 · The 50 lint rules

The lint stage is the unsexy magic. We have 50+ deterministic rules that reject obviously-bad MCQs without involving a human. The high-ROI ones:

· <b>Distractor length parity</b> — if option B is 3× longer than the others, students pick it (a real effect; "longest = correct" is a 60% heuristic on bad MCQs). We auto-reject if max length > 1.8× median. · <b>Answer-position balance</b> — over a generated batch of 100, answer should distribute roughly 25% per position. If 50% are "B", the LLM is biased; reject the batch and reprompt. · <b>Banned-word patterns</b> — "always", "never", "all of the above", "none of the above" — almost always smelly options. Auto-flag, human reviews. · <b>Duplicate-stem detection</b> — cosine similarity (MiniLM embeddings) vs existing bank. > 0.92 = reject, 0.85–0.92 = flag. · <b>Negative phrasing without bold</b> — "Which of the following is NOT…" with "NOT" in regular weight. Auto-fix: bold the negation, re-render. · <b>Source attribution</b> — every PYQ-derived question must cite (exam, year, paper). No citation = reject. · <b>Answer must be in option set</b> — sounds obvious; the LLM occasionally writes "E" or returns an answer not matching any letter.

3 · The SME review tool

SME reviewers see one MCQ at a time in a focused UI. Five buttons: <b>Accept</b>, <b>Accept with edits</b>, <b>Request rewrite</b>, <b>Reject — duplicate</b>, <b>Reject — wrong</b>. Time-per-MCQ averages 92 seconds.

Reviewers are paid per accepted MCQ (₹50/question for India SMEs, $0.75 for global). At 30 accepted/hour that's ₹1,500/hour — competitive with private tutoring rates, and the work is async + remote.

4 · The one rule that catches the last 10%

After the lint + SME review, we still ship MCQs that have subtly-wrong answers. Not stupidly wrong — subtly. The kind where the textbook says A but the more rigorous source says B, and the SME didn't catch it.

The rule that found these: <b>cohort accuracy + report rate</b>. After 100 attempts, if accuracy is very high (>95%) but more than 5% of attempters tap "report this question", the MCQ is auto-quarantined. The pattern: students get it "right" by elimination but suspect the question is wrong.

This rule alone caught 28% of the post-lint defects in the first 6 months.

5 · The cost structure

Per accepted MCQ (averaged over a 5,000-question week): LLM API ≈ ₹4, SME review ≈ ₹50, ops overhead ≈ ₹6. Total ≈ ₹60/MCQ.

Compared to ₹420/MCQ for a full hand-write. The economics finally work; quality matches; speed is 16× faster. The trade is: we have an SME reviewer queue and a lint engineering team. Both are easy to scale; the original "hire 50 writers and wait" was not.

6 · What we explicitly DON'T do

· We don't auto-publish LLM output. Every MCQ gets a human eye before it reaches a student. · We don't generate explanations + questions in the same prompt. The explanation is a separate LLM call after the question is human-accepted, and the explanation is reviewed by a different SME. · We don't pay LLMs to write questions in subjects where the source material is contested (history, current affairs interpretation, controversial economics). Those are still hand-written.

7 · The dashboard

Every week, /admin/content/quality shows the funnel: how many were generated, lint-rejected, SME-rejected, accepted, calibrated, public. Plus the post-publish flag rate, the median time-to-fix on flagged MCQs, the SME accept rate by reviewer. The dashboard is read-only to ops, editable by Head of Content. It is the single most-looked-at internal page after the on-call console.

"AI didn't replace our writers. It moved them from typing to judging."