{"id":116,"date":"2026-06-24T17:22:40","date_gmt":"2026-06-24T17:22:40","guid":{"rendered":"https:\/\/turkoloji.net\/notes\/?p=116"},"modified":"2026-06-24T17:32:28","modified_gmt":"2026-06-24T17:32:28","slug":"i3-ai-systems-and-the-limits-of-bilingual-sentence-interpretation-a-linguistic-perspective","status":"publish","type":"post","link":"https:\/\/turkoloji.net\/notes\/i3-ai-systems-and-the-limits-of-bilingual-sentence-interpretation-a-linguistic-perspective\/","title":{"rendered":"I(3) AI Systems and the Limits of Bilingual Sentence Interpretation: A Linguistic Perspective"},"content":{"rendered":"<p>Artificial intelligence systems that process natural language have now achieved remarkable performance in monolingual tasks. However, this is not the case under all circumstances. Their capacity to interpret bilingual or mixed language sentences remains structurally constrained. This is especially true when it comes to the Macedonian Turkish dialects that fall within the boundaries of the <strong>Balkan Sprachbund<\/strong>. This limitation is not primarily a matter of vocabulary size or training data volume. It concerns how meaning is represented in statistical language models. Small and underrepresented languages are even more affected.<\/p>\n<p>A useful set of examples, collected by the author in the Turkish dialect of Skopje, can be observed in mixed Turkish and Macedonian influenced varieties, particularly in related urban bilingual environments:<\/p>\n<p>[1] <strong>\u201cB\u00fcg\u00fcn gitt\u0131m snimanyeye Televiziyada.\u201d<\/strong><\/p>\n<p>In its intended interpretation, this sentence refers to attending a television recording session today. The lexical item <strong>\u201csnimanye\u201d<\/strong> corresponds to the Macedonian noun <strong>\u201c\u0441\u043d\u0438\u043c\u0430\u045a\u0435,\u201d<\/strong> meaning <strong>\u201crecording,\u201d<\/strong> while \u201cteleviziyada\u201d refers to a television broadcasting institution or studio context, with locative case marking at the end. Locative case marking is often used as dative case in the Turkish dialects in North Macedonia (\u0410\u0445\u043c\u0435\u0434, 2004, p. 57). A human bilingual speaker reconstructs the meaning by integrating Turkish morphosyntax with Macedonian lexical insertions and pragmatic knowledge of media production environments.<\/p>\n<p>Additional examples from Skopje Turkish dialect usage further illustrate this phenomenon:<\/p>\n<p>[2] <strong>\u201cKlikala te buni.\u201d<\/strong><\/p>\n<p>Intended meaning: \u201cClick exactly this one.\u201d This reflects a hybrid structure where a Macedonian derived verb form is integrated into a Turkish communicative imperative context.<\/p>\n<p>[3] <strong>\u201cMaks\u0131ma ver ts\u0131rtalas\u0131n, em bitti dovasi.\u201d<\/strong><\/p>\n<p>Intended meaning: \u201cGive a child to draw, and that\u2019s it.\u201d Here, <strong>\u201cts\u0131rtalamak\u201d<\/strong> derives from Macedonian <strong>\u201c\u0446\u0440\u0442\u0430\u201d (to draw)<\/strong> verb, integrated into a Turkish verbal framework, while the clause structure reflects conversational compression typical of bilingual speech.<\/p>\n<p>Verbs <strong>&#8220;klikalamak&#8221;<\/strong> and <strong>&#8220;ts\u0131rtalamak&#8221;<\/strong> given in the examples [2] and [3] are copied verbs (Ahmed, 2016). Sentences of this type <strong>are not<\/strong> consistently and accurately translated by currently available large language models.<\/p>\n<p>These examples demonstrate that bilingual speech in contact zones is not random lexical mixing but a structured communicative system shaped by long term language contact, cognitive economy, pragmatic inference and sense of being part of a society.<\/p>\n<p>Current AI systems typically process such input through token segmentation and probabilistic association. While they may approximate meaning under favorable conditions, they do not reliably reconstruct the intended cross linguistic mapping when orthography is inconsistent or when lexical boundaries shift across languages within a single clause. As Grosjean (1982) argues in his foundational work on bilingualism, bilingual speech is not a mixture of two monolingual systems but a fully integrated communicative mode. This integration poses a structural challenge for models trained primarily on monolingual distributions or artificially segmented multilingual corpora.<\/p>\n<p>From a computational perspective, modern large language models are built on the transformer architecture, which processes token sequences through learned statistical regularities via self attention mechanisms (Vaswani et al., 2017). Although multilingual pretraining improves robustness, it does not guarantee stable interpretation of hybrid forms where phonetic spelling, code switching, and pragmatic compression occur simultaneously. Zhang et al. (2023) found that multilingual large language models are not yet competent code switchers.<\/p>\n<p>Muysken (2000) similarly demonstrates that code switching is rule governed and context sensitive, not random mixture.<\/p>\n<p>This implies that correct interpretation requires not only lexical alignment but also discourse level inference, which remains an area of partial competence for current AI systems. Newer evaluation work supports this directly. Mohamed et al. (2025) note that existing benchmarks concentrate on surface level tasks such as language identification, sentiment, and part of speech tagging, leaving deeper semantic and reasoning capacities largely untested. Sheth et al. (2025) catalog ongoing evaluation efforts across more than 300 studies and identify open problems that remain unresolved despite continued progress in multilingual pretraining. These findings, drawn from evaluations published in 2025, reinforce the original argument empirically. The structural limitation is not primarily about scale or vocabulary coverage. It concerns how statistical models represent meaning when two linguistic systems are genuinely fused rather than alternated, a distinction that matters directly for dialect contact zones such as the <strong>Balkan Sprachbund<\/strong>.<\/p>\n<p>In all the provided examples, a human bilingual speaker reconstructs meaning through a combination of phonological approximation and shared cultural context. And pragmatic inference, too. The system identifies \u201csnimanye\u201d with recording contexts, \u201cts\u0131rtalamak\u201d with drawing activity, and interprets mixed imperatives such as \u201cKlikala te buni\u201d within a situational frame of digital interaction.<\/p>\n<p>The limitation observed here is not the absence of bilingual data in training, but the absence of grounded pragmatic interpretation that dynamically integrates cross linguistic cues in real time. As a result, AI systems may produce plausible monolingual interpretations while missing the intended bilingual communicative act.<\/p>\n<p>This distinction is critical for applications in low resource language contexts, dialectal variation, and informal digital communication, where hybrid linguistic forms are common. It also suggests that future improvements in AI language understanding will require deeper integration of discourse modeling, pragmatic inference, and structured representations of bilingual speech behavior.<\/p>\n<hr \/>\n<p><strong>References<\/strong><\/p>\n<p>\u0410\u0445\u043c\u0435\u0434, \u041e. (2004). <em>\u041c\u043e\u0440\u0444\u043e\u0441\u0438\u043d\u0442\u0430\u043a\u0441\u0430 \u043d\u0430 \u0442\u0443\u0440\u0441\u043a\u0438\u0442\u0435 \u0433\u043e\u0432\u043e\u0440\u0438 \u043e\u0434 \u041e\u0445\u0440\u0438\u0434\u0441\u043a\u043e \u041f\u0440\u0435\u0441\u043f\u0430\u043d\u0441\u043a\u0438\u043e\u0442 \u0440\u0435\u0433\u0438\u043e\u043d<\/em>. \u041d\u0435\u043e\u0431\u0458\u0430\u0432\u0435\u043d\u0430 \u0434\u043e\u043a\u0442\u043e\u0440\u0441\u043a\u0430 \u0434\u0438\u0441\u0435\u0440\u0442\u0430\u0446\u0438\u0458\u0430. \u0424\u0438\u043b\u043e\u043b\u043e\u0448\u043a\u0438 \u0444\u0430\u043a\u0443\u043b\u0442\u0435\u0442 \u201e\u0411\u043b\u0430\u0436\u0435 \u041a\u043e\u043d\u0435\u0441\u043a\u0438\u201c, \u0423\u043d\u0438\u0432\u0435\u0440\u0437\u0438\u0442\u0435\u0442 \u201e\u0421\u0432. \u041a\u0438\u0440\u0438\u043b \u0438 \u041c\u0435\u0442\u043e\u0434\u0438\u0458\u201c, \u0421\u043a\u043e\u043f\u0458\u0435.<\/p>\n<p>Ahmed, O. (2016). Copied verbs in Turkish dialects of Macedonia. In E. \u00c1. Csat\u00f3, B. Karako\u00e7, &amp; A. Menz (Eds.), <em data-start=\"278\" data-end=\"372\">The Uppsala Meeting: Proceedings of the 16th International Conference on Turkish Linguistics<\/em> (pp. 9\u201318). Harrassowitz Verlag.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal\">Mohamed, A., Zhang, Y., Vazirgiannis, M., &amp; Shang, G. (2025). <em>Lost in the mix: Evaluating LLM understanding of code-switched text<\/em>. arXiv. <a class=\"underline underline underline-offset-2 decoration-1 decoration-current\/40 hover:decoration-current focus:decoration-current\" href=\"https:\/\/arxiv.org\/abs\/2506.14012\">https:\/\/arxiv.org\/abs\/2506.14012<\/a><\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal\">Sheth, R., Sinha, S. R., Patil, M., Beniwal, H., &amp; Singh, M. (2025). <em>Beyond monolingual assumptions: A survey of code-switched NLP in the era of large language models across modalities<\/em>. arXiv. <a class=\"underline underline underline-offset-2 decoration-1 decoration-current\/40 hover:decoration-current focus:decoration-current\" href=\"https:\/\/arxiv.org\/abs\/2510.07037\">https:\/\/arxiv.org\/abs\/2510.07037<\/a><\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal\">Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \u0141., &amp; Polosukhin, I. (2017). Attention is all you need. In <em>Advances in Neural Information Processing Systems 30<\/em> (pp. 5998\u20136008).<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal\">Zhang, R., Cahyawijaya, S., Cruz, J. C. B., Winata, G., &amp; Aji, A. F. (2023). Multilingual large language models are not (yet) code-switchers. In <em>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing<\/em> (pp. 12567\u201312582). Association for Computational Linguistics.<\/p>\n<hr \/>\n<p><strong>How to cite this post:<\/strong><\/p>\n<p>Ahmed, O. (2026). AI systems and the limits of bilingual sentence interpretation: A linguistic perspective. <em>Notes on Language and Artificial Intelligence, I<\/em>(3). <a href=\"https:\/\/turkoloji.net\/notes\/i3-ai-systems-and-the-limits-of-bilingual-sentence-interpretation-a-linguistic-perspective\/\">https:\/\/turkoloji.net\/notes\/i3-ai-systems-and-the-limits-of-bilingual-sentence-interpretation-a-linguistic-perspective\/<\/a><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Artificial intelligence systems that process natural language have now achieved remarkable performance in monolingual tasks. However, this is not the case under all circumstances. Their capacity to interpret bilingual or mixed language sentences remains structurally constrained. This is especially true when it comes to the Macedonian Turkish dialects that fall within the boundaries of the Balkan Sprachbund. This limitation is not primarily a matter of vocabulary size or training data volume. It concerns how meaning is represented in statistical language models. Small and underrepresented languages are even more affected. A useful set of examples, collected by the author in the&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,4,8],"tags":[],"class_list":["post-116","post","type-post","status-publish","format-standard","hentry","category-ai","category-dialects","category-volume-i"],"_links":{"self":[{"href":"https:\/\/turkoloji.net\/notes\/wp-json\/wp\/v2\/posts\/116","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/turkoloji.net\/notes\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/turkoloji.net\/notes\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/turkoloji.net\/notes\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/turkoloji.net\/notes\/wp-json\/wp\/v2\/comments?post=116"}],"version-history":[{"count":2,"href":"https:\/\/turkoloji.net\/notes\/wp-json\/wp\/v2\/posts\/116\/revisions"}],"predecessor-version":[{"id":120,"href":"https:\/\/turkoloji.net\/notes\/wp-json\/wp\/v2\/posts\/116\/revisions\/120"}],"wp:attachment":[{"href":"https:\/\/turkoloji.net\/notes\/wp-json\/wp\/v2\/media?parent=116"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/turkoloji.net\/notes\/wp-json\/wp\/v2\/categories?post=116"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/turkoloji.net\/notes\/wp-json\/wp\/v2\/tags?post=116"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}