I(3) AI Systems and the Limits of Bilingual Sentence Interpretation: A Linguistic Perspective

Artificial intelligence systems that process natural language have now achieved remarkable performance in monolingual tasks. However, this is not the case under all circumstances. Their capacity to interpret bilingual or mixed language sentences remains structurally constrained. This is especially true when it comes to the Macedonian Turkish dialects that fall within the boundaries of the Balkan Sprachbund. This limitation is not primarily a matter of vocabulary size or training data volume. It concerns how meaning is represented in statistical language models. Small and underrepresented languages are even more affected.

A useful set of examples, collected by the author in the Turkish dialect of Skopje, can be observed in mixed Turkish and Macedonian influenced varieties, particularly in related urban bilingual environments:

[1] “Bügün gittım snimanyeye Televiziyada.”

In its intended interpretation, this sentence refers to attending a television recording session today. The lexical item “snimanye” corresponds to the Macedonian noun “снимање,” meaning “recording,” while “televiziyada” refers to a television broadcasting institution or studio context, with locative case marking at the end. Locative case marking is often used as dative case in the Turkish dialects in North Macedonia (Ахмед, 2004, p. 57). A human bilingual speaker reconstructs the meaning by integrating Turkish morphosyntax with Macedonian lexical insertions and pragmatic knowledge of media production environments.

Additional examples from Skopje Turkish dialect usage further illustrate this phenomenon:

[2] “Klikala te buni.”

Intended meaning: “Click exactly this one.” This reflects a hybrid structure where a Macedonian derived verb form is integrated into a Turkish communicative imperative context.

[3] “Maksıma ver tsırtalasın, em bitti dovasi.”

Intended meaning: “Give a child to draw, and that’s it.” Here, “tsırtalamak” derives from Macedonian “црта” (to draw) verb, integrated into a Turkish verbal framework, while the clause structure reflects conversational compression typical of bilingual speech.

Verbs “klikalamak” and “tsırtalamak” given in the examples [2] and [3] are copied verbs (Ahmed, 2016). Sentences of this type are not consistently and accurately translated by currently available large language models.

These examples demonstrate that bilingual speech in contact zones is not random lexical mixing but a structured communicative system shaped by long term language contact, cognitive economy, pragmatic inference and sense of being part of a society.

Current AI systems typically process such input through token segmentation and probabilistic association. While they may approximate meaning under favorable conditions, they do not reliably reconstruct the intended cross linguistic mapping when orthography is inconsistent or when lexical boundaries shift across languages within a single clause. As Grosjean (1982) argues in his foundational work on bilingualism, bilingual speech is not a mixture of two monolingual systems but a fully integrated communicative mode. This integration poses a structural challenge for models trained primarily on monolingual distributions or artificially segmented multilingual corpora.

From a computational perspective, modern large language models are built on the transformer architecture, which processes token sequences through learned statistical regularities via self attention mechanisms (Vaswani et al., 2017). Although multilingual pretraining improves robustness, it does not guarantee stable interpretation of hybrid forms where phonetic spelling, code switching, and pragmatic compression occur simultaneously. Zhang et al. (2023) found that multilingual large language models are not yet competent code switchers.

Muysken (2000) similarly demonstrates that code switching is rule governed and context sensitive, not random mixture.

This implies that correct interpretation requires not only lexical alignment but also discourse level inference, which remains an area of partial competence for current AI systems. Newer evaluation work supports this directly. Mohamed et al. (2025) note that existing benchmarks concentrate on surface level tasks such as language identification, sentiment, and part of speech tagging, leaving deeper semantic and reasoning capacities largely untested. Sheth et al. (2025) catalog ongoing evaluation efforts across more than 300 studies and identify open problems that remain unresolved despite continued progress in multilingual pretraining. These findings, drawn from evaluations published in 2025, reinforce the original argument empirically. The structural limitation is not primarily about scale or vocabulary coverage. It concerns how statistical models represent meaning when two linguistic systems are genuinely fused rather than alternated, a distinction that matters directly for dialect contact zones such as the Balkan Sprachbund.

In all the provided examples, a human bilingual speaker reconstructs meaning through a combination of phonological approximation and shared cultural context. And pragmatic inference, too. The system identifies “snimanye” with recording contexts, “tsırtalamak” with drawing activity, and interprets mixed imperatives such as “Klikala te buni” within a situational frame of digital interaction.

The limitation observed here is not the absence of bilingual data in training, but the absence of grounded pragmatic interpretation that dynamically integrates cross linguistic cues in real time. As a result, AI systems may produce plausible monolingual interpretations while missing the intended bilingual communicative act.

This distinction is critical for applications in low resource language contexts, dialectal variation, and informal digital communication, where hybrid linguistic forms are common. It also suggests that future improvements in AI language understanding will require deeper integration of discourse modeling, pragmatic inference, and structured representations of bilingual speech behavior.


References

Ахмед, О. (2004). Морфосинтакса на турските говори од Охридско Преспанскиот регион. Необјавена докторска дисертација. Филолошки факултет „Блаже Конески“, Универзитет „Св. Кирил и Методиј“, Скопје.

Ahmed, O. (2016). Copied verbs in Turkish dialects of Macedonia. In E. Á. Csató, B. Karakoç, & A. Menz (Eds.), The Uppsala Meeting: Proceedings of the 16th International Conference on Turkish Linguistics (pp. 9–18). Harrassowitz Verlag.

Mohamed, A., Zhang, Y., Vazirgiannis, M., & Shang, G. (2025). Lost in the mix: Evaluating LLM understanding of code-switched text. arXiv. https://arxiv.org/abs/2506.14012

Sheth, R., Sinha, S. R., Patil, M., Beniwal, H., & Singh, M. (2025). Beyond monolingual assumptions: A survey of code-switched NLP in the era of large language models across modalities. arXiv. https://arxiv.org/abs/2510.07037

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30 (pp. 5998–6008).

Zhang, R., Cahyawijaya, S., Cruz, J. C. B., Winata, G., & Aji, A. F. (2023). Multilingual large language models are not (yet) code-switchers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 12567–12582). Association for Computational Linguistics.


How to cite this post:

Ahmed, O. (2026). AI systems and the limits of bilingual sentence interpretation: A linguistic perspective. Notes on Language and Artificial Intelligence, I(3). https://turkoloji.net/notes/i3-ai-systems-and-the-limits-of-bilingual-sentence-interpretation-a-linguistic-perspective/

 

 

I(2) Why AI-Text Detectors Do Not Work and Why Their Results Cannot Be Trusted

Why AI-Text Detectors Do Not Work and Why Their Results Cannot Be Trusted

With the expansion of artificial intelligence (AI) and the increasing use of AI-generated text, teachers, publishers, and other professionals have begun to face the problem of determining whether a text is AI-generated, human-written, or the result of a combination of both.

AI-text detectors are tools that claim to distinguish between human-written and machine-generated text. Or are they really what they claim to be?

Top academic institutions, employers, and publishers have begun adopting them as gatekeeping instruments. The research evidence, however, shows that these tools are unreliable in ways that make their use actively harmful.

The core problem behind AI-text being flagged is technical.

Strip away the marketing language and what these detectors actually measure is adherence to standard written English. Text that follows grammatical rules and formal organizational conventions scores high for AI probability. Text with colloquialisms, fragments, and informal phrasing scores low.

The tools are not detecting AI. They are detecting writing quality, and penalizing students for demonstrating exactly the skills they were taught. Meaning that if you don’t want to be flagged, you should use worse language with a lot of colloquialisms and slang. Wow!

This flaw is demonstrable by testing the tools against material that predates the existence of generative AI entirely. As I discuss in my book (Ahmed, 2026, pp. 69-72), three short stories written in 1995 on a word processor, with no AI assistance because no such assistance existed for creative writing at the time, were fed through a leading detector. All three were flagged at over 80 percent AI probability. A system that cannot distinguish between human writing from 1995 and AI writing from 2026 is not measuring what it claims to measure.

I use AI to check my English grammar. I mostly accept AI’s suggestions for better sentence structures, but then that text is flagged as AI-generated. Sometimes hybrid. Sometimes even 100%.

Why? Because I used AI for grammar check. Didn’t we have that for decades in other forms, such as Grammarly? What about editors? Aren’t those people correcting your text if needed?

Other researchers have replicated this pattern. Classic essays, published novels from the 1960s, and historical documents written long before computers existed are routinely flagged by commercial detectors.

The reason is consistent: well-written, formally structured text produces the same statistical patterns whether it was written by a person thirty years ago or generated by a model today, because the models were trained on human writing and learned its patterns.

The false positive problem is particularly severe for non-native speakers of English. Like me. Liang et al. (2023) evaluated seven widely used GPT-detection systems using two datasets: essays written by native English-speaking US eighth-grade students and essays written by non-native speakers for the TOEFL exam. The detectors correctly classified most of the native-speaker essays, with a mean false positive rate of 5.1 percent. For the TOEFL essays, the mean false positive rate was 61.3 percent, and all seven detectors unanimously flagged 19.8 percent of those human-written essays as AI-generated. The study concluded that non-native speakers’ writing tends toward lower lexical richness and syntactic diversity, which the detectors read as machine output. These tools embed and amplify a structural bias against multilingual writers.

OpenAI itself released an AI classifier in January 2023 and withdrew it on July 20, 2023, citing a low rate of accuracy. The company’s own documentation stated that the classifier correctly identified AI-written text only 26 percent of the time while incorrectly labeling human-written text as AI-generated 9 percent of the time (OpenAI, 2023). A 9 percent false positive rate applied to a classroom of 30 students means roughly three students face wrongful accusation on any given assignment.

The perverse incentive that follows from this is real and documented. Students learn quickly that formal, well-structured writing is more likely to be flagged. The rational response is to degrade the work: include sentence fragments, add casual phrasing, introduce minor errors. Institutions are inadvertently teaching students that excellence is dangerous and that the goal is plausible mediocrity rather than genuine quality. Is this progress or regression? That’s what we have to think about.

The business model behind the detection industry compounds the problem. Several companies sell both detection services and “humanization” services that rewrite AI-generated text to pass detection. That is the main point. To sell you fear and to take your money. A company that profits from flagging text has a financial incentive to flag aggressively, producing more worried users and more demand for the paid solution. The humanization process itself uses AI to fool the AI detector, which demonstrates conclusively that the underlying signal the detectors rely on is not stable or meaningful.

There is also a fundamental epistemological problem. Running the same text through multiple detection tools produces wildly divergent results. One tool may return 85 percent AI probability while another returns 23 percent on the same passage. These are not measurement errors around a true value. There is no ground truth to converge on, because the tools are applying different statistical models to a problem that has no clean solution.

Better approaches to academic integrity exist and do not require these tools. Teachers who know their students can identify suspicious work through patterns no algorithm captures: a sudden unexplained jump in sophistication, inconsistency with the student’s voice in class discussion, or a polished final product with no visible process of development. Oral assessment, where students present and defend their work, exposes misuse immediately. Process documentation, including drafts and outlines, provides evidence of genuine engagement that cannot be faked by submitting a single AI-generated output.

Accepting detector output as evidence in disciplinary proceedings means accepting a tool with a documented and significant error rate, a demonstrated bias against non-native speakers, a conflicted commercial incentive to over-flag, and no methodology that has been independently validated against real-world academic writing.

In legal and scientific reasoning, a measurement instrument must demonstrate validity before its outputs are used to make consequential decisions about people. AI-text detectors have not met this standard. Their continued use in contexts that carry real stakes for real individuals is not justified by the available evidence.


References

Ahmed, O. (2026). If humans can learn from books, why can’t AI? Reconsidering the training data debate. Intelligentia Nova Press.

Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non-native English writers. Patterns, 4(7), Article 100779. https://doi.org/10.1016/j.patter.2023.100779

OpenAI. (2023). New AI classifier for indicating AI-written text. https://openai.com/index/new-ai-classifier-for-indicating-ai-written-text/


How to cite this post:

Ahmed, O. (2026). Why AI-text detectors do not work and why their results cannot be trusted. Notes on Language and Artificial Intelligence, I(2). https://turkoloji.net/notes/i2-why-ai-text-detectors-do-not-work-and-why-their-results-cannot-be-trusted/