Wednesday, June 25, 2025

Aye-Yai-AI

 I've been interacting with large language models (LLMs) for about a year now, and over that time they have become both more useful and more frustrating. Many of the complaints about the early versions of ChatGPT (extreme hallucinations, stilted prose) don't hold water in the way they used to. The most recent version of ChatGPT (4.5) does a fairly impressive job of pulling in information from online sources if required to improve responses. From that point of view, output is much more usable than previously. 

The other day I gave the preview version of Google's Gemini "Deep Research" a prompt to create a report on the "state-of-the-field" in terms of the use of AI for educational assessment, particularly in the context of second and foreign language testing. After about five minutes, it had dutifully produced a multi-page document that included, as requested, sections on automatic item generation, speech recognition, automated scoring, and the like, complete with a section on ethical concerns and implications. Although it didn't add citations to the report itself, it did list all of the references which were consulted in creating the final product (and, if you dig into the details, a second list of references that were "read" but not used in the final draft). With a simple push of another button -- and a five minute wait -- it even produced a short "podcast" of highlights from the report, presented by virtual hosts, complete with pauses and overlapping voices. 

From a technical standpoint, this is impressive to the point of being magical. So why do I feel so uneasy about the process and the output?

My first thought was that perhaps the results weren't very accurate, though that didn't seem to be the case. Although there are still stories of AI famously creating summer reading lists that include fictitious books, most of the information appeared to be plausible. 

That said, there was certainly a degree of "superficiality" about the report -- although it was based on sources, it was more "magazine article" than "academic" in how it read. But "superficial" doesn't quite capture the issue (and review articles, almost by definition, have to stay at the 30,000 foot view if they cover a range of topics). 

I still don't have a good word for my uneasiness, but I think it has to do with the fact that even though AI can scan more articles in minutes than I can hope to read in my lifetime, it has no sense of "the field" in the way that an academic would. The word "validity" doesn't conjure up Messick's work or other seminal figures in the ongoing debate. AI is not a discerning reader -- you are just as likely to have a reference from a blog post or third-tier (even predatory) journal as from Language Testing (and perhaps even more so, since the good journals tend to be behind paywalls). 

At the end of the day, whether academic fields are able to survive the AI onslaught is yet to be seen. (There is a wonderfully titled article from the Mayo Clinic called "The Debris Field of Scientific Publications Authored by Artificial Intelligence" that I haven't yet been able to access to read.) If articles are increasingly generated by unthinking AI rather than thinking humans, we run the risk of polluting the intellectual waters from which we need to drink.       

No comments: