I recently played around a bit with a new LLM-powered app called Consensus. The goal of the app is to provide functionality akin to Google Scholar + ChatGPT, such that you are not merely searching for references, but also building up an understanding of the main points of those references, like a mini literature review based on your query. If you ask a "yes / no" question (e.g., "Is it bad to exercise immediately after eating?"), you will not only receive a list of references, but also a "Consensus Meter" that attempts to quantify the sentiments from those references (with a "Possibly" sentiment thrown in). In other words, at a glance you could see how the majority of the references might lean in answering the question.
After trying a few general queries to get used to the interface, I decided to try a more provocative question to see what the "consensus" would be and what kind of results would be returned. At the risk of readers misinterpreting these musings as representing a particular stance on the issue, here is what I noticed.
The prompt I used was: "Is there general consensus that biological sex (not gender expression) is essentially binary in humans?" I phrased it that way because I wanted to have a response that would give me a "Consensus Meter."
The results pulled back nine papers, five of which were used to provide the "consensus" (which was Yes: 20%, No: 80%, with 0% for Possibly). I was a little surprised at the result, as I had assumed that specifying that I was not interested in "gender expression" in the prompt would break the opposite of what it did. (AFAIK, medical researchers don't break down research findings into multiple biological sex categories beyond the binary unless that is the particular focus of the study, but that's just an assumption as I don't read a lot of medical literature.)
I decided to look more closely at the results. I was a little disappointed to find that half of the nine references returned had under five citations (and two had zero citations). Of course, the more recent the research, the fewer the citations (even for very high quality research), so even that metric can be misleading. Two of those low citation references were used in the "Consensus Meter."
I don't know enough about the medical journal landscape to know which journals are more "reputable," though the app did mark some journals as "Rigorous Journal" when I asked for additional results. (It is interesting that the rigorous journal papers were not in the first batch of results which informed the "Consensus Meter" -- once created, the meter doesn't change when clicking on "additional results.") It did appear that there was a mix of types of journals: Anatomical Sciences Education, American Psychologist, Policy Insights from the Behavioral and Brain Sciences, etc., which may have biased the results towards a more applied answer than a strictly biological/medical one. Or maybe those results really are representative of the current consensus.
I tried the query again, rephrased as "Is there general consensus in the medical (not psychological) community that biological sex (not gender expression) is essentially binary in humans?," but that gave me (as far as I could tell -- the results are not saved in the free version) essentially the same list of references and summary as the previous query (but, interestingly enough, didn't include a "Consensus Meter" this time). There is probably some amount of caching of similar queries to save on costs for the app provider.
I tried another query more in line with an area of research that I am somewhat familiar with, second language acquisition. This time, the prompt I used was: "Does being raised in a bilingual household have cognitive advantages?" There was a period of time a decade or two ago in which there was lots of popular press about "the bilingual advantage," though more recent research has been mixed.
This time, the "Consensus Meter" was evenly split (Yes: 42%, No: 42%, Possibly: 17%). I'm not completely up-to-date myself, but this matches my sense of where things are. As before, I did notice that one of the results had no citations (though it was a recent paper, so it could be for that reason). I recognized some of the names and articles that were returned, but was surprised to see that Ellen Bialystok's work (the name I most associate with this kind of research) didn't appear until I hit the "more results" button.
I can see how this app could be useful. There is a pro feature (which I didn't try as it is extremely limited in the free version) that allows you to get more information on a particular study. For some studies, you can query the study directly (which I assume is a kind of Retrieval Augmented Generation (RAG) process for those studies that have full text), and for others you can at least get more information about sample size, methodology, etc. You can also keep asking for additional results beyond the ten (or so) that are pulled up with your first query. (To be fair, many journals now have a "you may also be interested in ...." type recommendations as well if you have access to the journal online through a university library or similar.)
I can also see how this app could be misleading. In some sense, this is not the fault of the app, but how "big data" can distort as well as inform. To give a trivial example, that the Earth is round (or, at least, an oblate spheroid) and not flat would not be a contentious statement in scientific circles. In fact, it is so well-known that it is not even a topic of study at this point. However, if one were to use the number of Youtube videos as a metric of consensus, it is likely that one might get the false impression that the "flat Earth" hypothesis is an active area of research for astronomers.
But what about a less trivial example? If there are perverse incentives to publish at all costs to be able to have any chance of obtaining an academic career, keep your grant, etc., and this rat race results in an
increasing reliance on falsified data and/or AI generated papers, this could lead to a degradation in the quality of thought that is
produced, which will feedback into the "scientific consensus" when seen through tools such as this app. There were more than 10,000 research articles retracted in 2023. Are those papers still retrieved in "big data" searches?
Author (and Harvard M.D.) Michael Crichton had this to say about consensus (the concept, not the app):
Let's be clear: the work of science has nothing whatever to do with consensus. Consensus is the business of politics. Science, on the contrary, requires only one investigator who happens to be right, which means that he or she has results that are verifiable by reference to the real world. In science consensus is irrelevant. What is relevant is reproducible results.
But even that sentiment, with which I am sympathetic, is oversimplified. The question of what constitutes knowledge is a thorny one, and there is not a simple straight line between the results of any one experiment and "the truth," particularly when we are looking at processes that are more social than physical. Reproducible results don't necessarily validate your theory per se, i.e., your model could be "wrong" but still useful. Philosopher of science Larry Laudan has argued that it is not irrational to believe in theory A (the established theory, which tends to solve more problems) while still keeping an eye on theory B (an incommensurate theory, which can't solve as many problems yet, but might be more internally consistent, etc.).
As I often tell my students, the answer to almost any interesting question seems to be "it depends" -- but "it depends" is not the same thing as "anything goes," and therein lies the challenge. The Consensus app may be a useful addition to the researcher's toolbox, but it shouldn't be the only one.

No comments:
Post a Comment