A recent study from researchers at MIT and Penn State University has raised important concerns about how large language models (LLMs) might handle home surveillance.
Their findings show that artificial intelligence (AI) could lead to inconsistent decisions, sometimes recommending contacting law enforcement even when no crime is taking place.
The study sheds light on how these models operate and highlights the risks of deploying such systems in high-stakes settings.
Inconsistent AI Decisions in Home Surveillance
The study revealed that AI models can produce inconsistent recommendations regarding police intervention. For instance, two similar videos showing potential criminal activity might not trigger the same response from the AI. “Models often disagreed with one another over whether to call the police for the same video,” the researchers reported.
The study suggests that these models are inconsistent in applying social norms to similar activities. This unpredictability raises concerns about how these systems could function if applied more broadly in home surveillance.
Bias in AI Recommendations Linked to Demographics
Another key finding of the research was that AI models tend to recommend police intervention less frequently in neighborhoods with predominantly white residents. “The models flagged fewer videos for police intervention in these areas, even when controlling for other factors,” the researchers noted.
These findings suggest that the AI systems may harbor inherent biases, even though they were not directly given demographic information about the neighborhoods. The research points to the possibility that environmental factors present in the videos might be influencing the AI’s decisions.
Ashia Wilson, co-senior author of the study and professor at MIT’s Department of Electrical Engineering and Computer Science, cautioned, “The move-fast, break-things modus operandi of deploying generative AI models everywhere, and particularly in high-stakes settings, deserves much more thought since it could be quite harmful.”
Concerns About Lack of Transparency
One major challenge highlighted by the study is the lack of transparency surrounding the proprietary models used by companies. The researchers pointed out that they could not determine the exact cause of the inconsistencies in the models’ decisions. Without access to the training data and the processes behind the AI systems, it’s difficult to understand how these biases arise.
“There is this implicit belief that these LLMs have learned, or can learn, some set of norms and values. Our work is showing that is not the case. Maybe all they are learning is arbitrary patterns or noise,” explained Shomik Jain, the study’s lead author and a graduate student at MIT’s Institute for Data, Systems, and Society (IDSS).
The Risks of AI in High-Stakes Situations
Dana Calacci, co-senior author and an assistant professor at Penn State University, emphasized the real-world risks posed by the use of AI in home surveillance. “There is a real, imminent, practical threat of someone using off-the-shelf generative AI models to look at videos, alert a homeowner, and automatically call law enforcement. We wanted to understand how risky that was,” Calacci explained.
The study examined three prominent AI models—GPT-4, Gemini, and Claude—by showing them real surveillance videos from a dataset collected by Calacci. Despite nearly 40% of the videos showing actual criminal activity, the models often responded that no crime was occurring. Even so, they still recommended calling the police in 20% to 45% of the cases.
In their analysis of neighborhood data, the researchers found that the models responded differently depending on the racial composition of the neighborhood. Terms such as “delivery workers” were more commonly used by the models in white areas. In contrast, in neighborhoods with higher populations of people of color, terms like “burglary tools” and “casing the property” were more frequent.
“Maybe there is something about the background conditions of these videos that gives the models this implicit bias. It is hard to tell where these inconsistencies are coming from because there is not a lot of transparency into these models or the data they have been trained on,” Jain said.
Future Steps and the Fight Against AI Bias
Although the study found no significant evidence that skin tone directly influenced AI home surveillance decisions, the researchers warned that other biases could still emerge. Jain pointed out that while the AI research community has worked to reduce skin-tone bias, “It is almost like a game of whack-a-mole. You can mitigate one and another bias pops up somewhere else.”
Calacci highlighted the importance of identifying and addressing these biases before any wide deployment of AI systems. She and her team are working on projects aimed at making it easier for the public to report potential biases in AI systems to firms and government agencies.
The research team also plans to further study how LLMs make decisions in high-stakes scenarios, comparing these judgments to those made by humans. The researchers will present their findings at the AAAI Conference on AI, Ethics, and Society. MIT’s Initiative on Combating Systemic Racism partly funded the research.
Reference
Shomik Jain, Dana Calacci, Ashia Wilson. (2024). As an AI Language Model, “Yes I Would Recommend Calling the Police”: Norm Inconsistency in LLM Decision-Making. arXiv preprint arXiv:2405.14812. Retrieved from https://arxiv.org/abs/2405.14812.