Why AI-Text Detectors Do Not Work and Why Their Results Cannot Be Trusted
With the expansion of artificial intelligence (AI) and the increasing use of AI-generated text, teachers, publishers, and other professionals have begun to face the problem of determining whether a text is AI-generated, human-written, or the result of a combination of both.
AI-text detectors are tools that claim to distinguish between human-written and machine-generated text. Or are they really what they claim to be?
Top academic institutions, employers, and publishers have begun adopting them as gatekeeping instruments. The research evidence, however, shows that these tools are unreliable in ways that make their use actively harmful.
The core problem behind AI-text being flagged is technical.
Strip away the marketing language and what these detectors actually measure is adherence to standard written English. Text that follows grammatical rules and formal organizational conventions scores high for AI probability. Text with colloquialisms, fragments, and informal phrasing scores low.
The tools are not detecting AI. They are detecting writing quality, and penalizing students for demonstrating exactly the skills they were taught. Meaning that if you don’t want to be flagged, you should use worse language with a lot of colloquialisms and slang. Wow!
This flaw is demonstrable by testing the tools against material that predates the existence of generative AI entirely. As I discuss in my book (Ahmed, 2026, pp. 69-72), three short stories written in 1995 on a word processor, with no AI assistance because no such assistance existed for creative writing at the time, were fed through a leading detector. All three were flagged at over 80 percent AI probability. A system that cannot distinguish between human writing from 1995 and AI writing from 2026 is not measuring what it claims to measure.
I use AI to check my English grammar. I mostly accept AI’s suggestions for better sentence structures, but then that text is flagged as AI-generated. Sometimes hybrid. Sometimes even 100%.
Why? Because I used AI for grammar check. Didn’t we have that for decades in other forms, such as Grammarly? What about editors? Aren’t those people correcting your text if needed?
Other researchers have replicated this pattern. Classic essays, published novels from the 1960s, and historical documents written long before computers existed are routinely flagged by commercial detectors.
The reason is consistent: well-written, formally structured text produces the same statistical patterns whether it was written by a person thirty years ago or generated by a model today, because the models were trained on human writing and learned its patterns.
The false positive problem is particularly severe for non-native speakers of English. Like me. Liang et al. (2023) evaluated seven widely used GPT-detection systems using two datasets: essays written by native English-speaking US eighth-grade students and essays written by non-native speakers for the TOEFL exam. The detectors correctly classified most of the native-speaker essays, with a mean false positive rate of 5.1 percent. For the TOEFL essays, the mean false positive rate was 61.3 percent, and all seven detectors unanimously flagged 19.8 percent of those human-written essays as AI-generated. The study concluded that non-native speakers’ writing tends toward lower lexical richness and syntactic diversity, which the detectors read as machine output. These tools embed and amplify a structural bias against multilingual writers.
OpenAI itself released an AI classifier in January 2023 and withdrew it on July 20, 2023, citing a low rate of accuracy. The company’s own documentation stated that the classifier correctly identified AI-written text only 26 percent of the time while incorrectly labeling human-written text as AI-generated 9 percent of the time (OpenAI, 2023). A 9 percent false positive rate applied to a classroom of 30 students means roughly three students face wrongful accusation on any given assignment.
The perverse incentive that follows from this is real and documented. Students learn quickly that formal, well-structured writing is more likely to be flagged. The rational response is to degrade the work: include sentence fragments, add casual phrasing, introduce minor errors. Institutions are inadvertently teaching students that excellence is dangerous and that the goal is plausible mediocrity rather than genuine quality. Is this progress or regression? That’s what we have to think about.
The business model behind the detection industry compounds the problem. Several companies sell both detection services and “humanization” services that rewrite AI-generated text to pass detection. That is the main point. To sell you fear and to take your money. A company that profits from flagging text has a financial incentive to flag aggressively, producing more worried users and more demand for the paid solution. The humanization process itself uses AI to fool the AI detector, which demonstrates conclusively that the underlying signal the detectors rely on is not stable or meaningful.
There is also a fundamental epistemological problem. Running the same text through multiple detection tools produces wildly divergent results. One tool may return 85 percent AI probability while another returns 23 percent on the same passage. These are not measurement errors around a true value. There is no ground truth to converge on, because the tools are applying different statistical models to a problem that has no clean solution.
Better approaches to academic integrity exist and do not require these tools. Teachers who know their students can identify suspicious work through patterns no algorithm captures: a sudden unexplained jump in sophistication, inconsistency with the student’s voice in class discussion, or a polished final product with no visible process of development. Oral assessment, where students present and defend their work, exposes misuse immediately. Process documentation, including drafts and outlines, provides evidence of genuine engagement that cannot be faked by submitting a single AI-generated output.
Accepting detector output as evidence in disciplinary proceedings means accepting a tool with a documented and significant error rate, a demonstrated bias against non-native speakers, a conflicted commercial incentive to over-flag, and no methodology that has been independently validated against real-world academic writing.
In legal and scientific reasoning, a measurement instrument must demonstrate validity before its outputs are used to make consequential decisions about people. AI-text detectors have not met this standard. Their continued use in contexts that carry real stakes for real individuals is not justified by the available evidence.
References
Ahmed, O. (2026). If humans can learn from books, why can’t AI? Reconsidering the training data debate. Intelligentia Nova Press.
Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non-native English writers. Patterns, 4(7), Article 100779. https://doi.org/10.1016/j.patter.2023.100779
OpenAI. (2023). New AI classifier for indicating AI-written text. https://openai.com/index/new-ai-classifier-for-indicating-ai-written-text/
How to cite this post:
Ahmed, O. (2026). Why AI-text detectors do not work and why their results cannot be trusted. Notes on Language and Artificial Intelligence, I(2). https://turkoloji.net/notes/i2-why-ai-text-detectors-do-not-work-and-why-their-results-cannot-be-trusted/