Public red-teaming and trust

DEF CON is one of the most important hacker conferences worldwide, held yearly in Las Vegas. This coming August, it will host a large simulation, in which thousands of security experts from the private sector and academia will be invited to compete against each other to uncover flaws and bias in the generative large language models (LLMs) produced by leading firms such as OpenAI, Google, Anthropic, Hugging Face, and Stability. While in traditional red-team events the targets are bugs in the code, hardware, or human infrastructure, participants at DEF CON have additionally been instructed to seek exploits through adversarial prompt engineering, so as to induce the LLMs to return troubling, dangerous, or unlawful content.

This initiative definitely goes in the right direction in terms of building trust through verification, and bespeaks significant confidence on the part of the companies, as it can safely be expected that the media outlets in attendance will be primed to amplify any failure or embarassing shortcoming in the models’ output. There are limits, however, to how beneficial such an exercise can be. For one thing, the target constituency is limited to the extremely digitally literate (and by extension to the government agencies and private businesses the firms aspire to add to their customer list): the simulation’s outcome cannot be expected to move the needle on the broad, non-specialist perception of AI models and their risks in the public at large. Also, the stress test will be performed on customized versions of the LLMs, made available by the companies specifically for this event. The Volkswagen emissions scandal is only the most visible instance of how one may exploit such a benchmarking system. What is properly needed is the possibility of an unannounced audit of LLMs on the ground in their actual real-world applications, on the model of the Michelin Guide’s evaluation process for chefs and restaurants.

In spite of these limitations, the organization of the DEF CON simulation if nothing else proves that the leading AI developers have understood that wide-scale adoption of their technology will require a protracted engagement with public opinion in order to address doubts and respond to deeply entrenched misgivings.