Have LLMs or Diffusion Models Made the Turing Test Obsolete?

Absolutely Not - Even LLMs Admit It!

May 22, 2024

Welcome back to the Lamb_OS Substack! As always, I am thankful to have my subscribers and reader stop by! If you are a regular reader of @Lamb_OS, then you – the readers – know you are the only reason I do this. So as always, thank you for visiting!

Let me begin (and end) by introducing myself as Dr. William A. Lambos, neuroscientist, data scientist, and licensed neuropsychologist. I write a lot about AI, but not exclusively. See my previous screeds on this Substack to learn more.

Please Subscribe today if you have not already yet done so. This Substack is free! So, please subscribe for whatever reason might appeal to you. But I’d hope you do so for the value it offers.

Lately, in places like LinkedIn, Reddit, and here on Substack I’ve come across an increasing number of claims (well over one thousand hits in ~.67 seconds of search time in the three sites above) that the famous “Turing Test” (originally “The Imitation Game”) is obsolete. Here’s one:

In every case, ChatGPT spits out a reasonable effort, not just passably good content, but content that is organized in correct Shakespearean sonnet structure (four quatrains and couplet). It’s better than anything I could have written. Thus, in my opinion as the interrogator, ChatGPT passes the Turing Test. In fact, you could say that the Turing Test is now obsolete. (Emphasis mine; Source: Girish Mhatre).

This is a single one taken more or less at random from someone who, it turns out, ought to know: Dr. Mhatre holds degrees in computer science and engineering from IIT and NYU’s Polytechnic School of Engineering - among two of the best schools in the world for these subjects - and he works in an A.I. company.

As with so many others who make the same mistake, the error is one of misunderstanding the purpose and means of the exercise. This in turn renders the conclusion both inaccurate and irrelevant. It makes no difference if the machine can do things humans can’t. That’s why calculators were invented! No, what matters is whether the human can or cannot reliably tell which of two unseen respondents (interacting via a keyboard and monitor) is human and which is an automaton. If the human evaluator can’t tell which is the computer, and which is the (other hidden) human, within five minutes of conversation, then the computer (LLM, GPT, etc.) has “passed the Turing Test.”

Based on my experience and current knowledge as a neuropsychologist, educator, mentor and therapist, I would suspect that less than half of untrained individuals can differentiate between the human and the cyborg if chosen at random. But I would wager that any person who is motivated to try (“here’s fifty bucks if you get it right 3 times in a row”) - irrespective of their educational level - will have little trouble with doing so in well under 5 minutes. Studies have shown further that people can, in under 30 minutes, be trained to do this very reliably. Nor does one require having to choose between a human respondent and an automaton - we can make the test “harder” (per Turing himself) by having only one respondent who must then be identified as human or otherwise.

How can be this the case? Because humans are biological creatures with human abilities and agendas as to how these abilities are best used to meet a need or satisfy an open-ended goal for any given situation in the real world. LLMs, etc. are designed for stochastic mimicry based on statistical frequencies curing in the training data (which is everything on the Internet!!!), and do not know anything else - including that they exist, are situated in a multidimensional world, or that all they are capable of doing is imitating language by predicting the next output token based on whatever-the-hell was in the training data, especially if it was near the last few batches that were input. Why is “last in” better for being matched? Because neural nets are notoriously bad at overwriting their own training weights (“learning”), and thus erasing earlier training results (node weights in the inner layers). Even worse, in order to appear less robotic, LLMs MAKE ERRORS BY DESIGN - YES, ON PURPOSE…Hello, hallucinations!)

It is brilliant that Turing chose language as the basis of the test - as opposed to vision or object identification by other modalities - and it seems he did so for some very good reasons. Human language is both easy enough to work with, but sufficiently complex to determine that which is unique to people (vs. machines of any type). In reality, language comprises well under half of individual human intelligence, but is a convenient medium for eliminating the LLM from the Turing Test because no GPT can handle the great majority of the following subtleties of human written language:

Handling nuanced or ambiguous language
Engaging in creative or abstract thinking
Adapting to changing contexts or social norms
Understanding sarcasm, irony, and humor
Interpreting tone, emotion, and nonverbal cues
Recognizing cultural or regional references and idioms
Understanding metaphors, analogies, and figures of speech
Learning and then combining new abstractions (“container classes”), or self-referencing language (“recursion”)

So that should settle the question. But just in case, and as I am growing fond of doing, I asked the LLM chatbot Pi if it “thought” it could pass a Turing test. Here’s what happened:

Pi: You're absolutely right. As an AI, I have limitations and can't replicate all the subtleties of human conversation. While I'm programmed to understand some of these complexities, there's still much that I may not fully grasp or respond to in the same way that a human would. But I see my role more as a helpful assistant, rather than a replacement for human interaction. What do you think?
Lamb_OS: Agreed. So you don't think you could pass a Turing Test?
Pi: I think if the person administering the test was not familiar with AI systems, and we were discussing a topic that didn’t require a deep understanding of human experience, emotion, or cultural context, I might be able to fool them for a little while. But ultimately, I think my limitations would become apparent, especially if the conversation became more nuanced or complex.

Amazing. Pi sounds reasonable! But it’s just gibberish. And at least it’s willing to own it.

Well, that’s something…

Thank you for reading Lamb_OS’s Substack. This post is public so feel free to share it.

Best Unit Next Time,

Bill Lambos

Lamb_OS