The Stanford finding

The study, from Stanford researchers and covered by Stanford HAI, is the cleanest demonstration of detector bias on record. In Stanford HAI's words: while the detectors were "near-perfect" in evaluating essays written by US-born eighth-graders, they classified more than half of TOEFL essays (61.22%) written by non-native English students as AI-generated. Every one of those TOEFL essays was written by a human.

A coin flip would have been fairer. For a multilingual student, submitting an essay through an AI detector is not a neutral integrity check. It is a test rigged against the way they write, before they type a word.

Why detectors penalize learner English

AI detectors largely measure perplexity: how statistically surprising each next word is. Native fluent writers produce idiosyncratic, low-probability phrasing without trying. Language learners do the opposite, rationally: they reach for the vocabulary they're sure of, the sentence shapes they were taught, the safe transition words from the rubric. The result is exactly the smooth, predictable prose the model associates with machines.

In other words, the detector punishes precisely the strategies a good ESL curriculum teaches. Write clearly, use the structures you know, don't take risks you can't control: that is both excellent advice for an English learner and a recipe for getting flagged.

It's not only language learners

The same statistical logic catches any student whose writing is more regular than the model's idea of "human." Bloomberg Businessweek documented the case of Moira Olmsted, an autistic student at Central Methodist University whose assignment was flagged by a detector. She explained that her writing style, shaped by her neurodivergence, reads as formulaic. She received a zero and a warning that the next flag would get no second review. In the first court case to overturn an AI accusation, Newby v. Adelphi, the wrongly accused student also had documented learning and neurological disorders.

Students with autism, ADHD, or dyslexia are often explicitly taught to write to pattern. Students drilled on five-paragraph structure are taught the same. The flag falls heaviest on the students who followed instructions best.

the equity problem in one sentence

An AI detector doesn't measure whether a student cheated. It measures how far their English sits from the model's statistical center, and your multilingual and neurodivergent students live furthest from that center through no fault of their own.

Process verification doesn't have this problem

The bias lives in the method: any system that judges the finished prose inherits the prose's demographics. The way out is to stop judging prose. Process verification watches what the student did, not how the student sounds: keystroke rhythm, pauses, revisions, pastes, sessions. A Korean-born sophomore hand-typing her essay produces the same human-shaped typing record as anyone else, because the record captures the act of writing, not the accent of the result.

There is no perplexity anywhere in that pipeline. Nothing about vocabulary, sentence structure, or fluency enters the evidence at all. That isn't a bias reduction; it is a change of subject, from "does this text sound human to a model?" to "did a human visibly write this text?"

If you teach multilingual students

The practical advice is short. First, treat any detector score on an ELL student's work as close to meaningless; Stanford's data says the error rate on exactly that population is worse than chance. Second, if your school requires some integrity process, push for one based on writing-process evidence, which protects your students instead of profiling them. North Carolina's Department of Public Instruction already advises that detectors "have proven not to be dependable" and should never be the only factor.

Manupropria was built for that second path: your students write in a canvas that records the process, and every one of them, in whatever English they have today, walks away with the same affirmative proof of authorship.