This is a very belated post about a presentation that I saw many months ago (summer 2020?). If memory serves, the presentation was from a NASEM (National Academies of Science Engineering Medicine) event, but I don't recall the details or the speakers. Like so many virtual talks during these pandemic times that I had planned to view with rapt attention, I ended up listening to it in the background while working on more pressing tasks.
I do recall, however, one of the discussions which was focused on the topic of AI fairness. One thing that made me perk up was when of the panelists noted that although we rely on supervised learning algorithms to solve many real-world prediction needs, they are limited in scope of what we can predict because our predictions will be for the same relationships on which the model was trained, such as the creditworthiness of a credit card applicant given demographic, income, banking history and other such information.
This made me recall the discussions of (educational) test validity from when I was in that world. At the end of the day, we really don't care about performance on the particular assessment tasks per se ("Hey, Tommy got Item 6 correct!"), but rather the inferences that we are able to make based on performance across all of the assessment tasks in the test battery. And we don't really care about predicting performance on the test battery, but rather the inferences that we can make about real-world performance in untested domains.
In the case of language proficiency, for example, we may use a thirty minute oral proficiency interview with a very limited number of tasks to assess how well the examinee might perform in a virtually infinite number "real world" language interactions post assessment. Are they ready (linguistically speaking) to lead the students on a study abroad program? To staff the front desk at an international hotel? To work as a foreign area officer? In other words, the tasks on the test do not constitute the sum total of the domain of potential to which you want to make the inference.
Thinking about learning analytics, we have the same challenge. Does it really do us any good to predict student performance in a course if that requires us to never change the course assignments because those were the items used to train the predictive model? What we need is a model of student performance that treats assignments as samples from a larger pool of potential assignments. We're not there yet.
Other parts of the discussion made me think of the distinction between norm referenced and criterion referenced decisions. Much of AI fairness looks at whether or not group level rates are equitable given the respective rates in the population. Are the false negative rates proportional? That the error rates are equitable at the group level (in a norm referenced kind of way) is little consolation if e.g., you are the one falsely denied a raise.
(If this post seems a little jumbled, it is because I was trying to reconstruct my thoughts based on a couple of phrases that were jotted down here months ago with a "make a post out of this" reminder to myself.....)

No comments:
Post a Comment