Robustness and Security of Vision–Language Models

Thu, 18 Jun 2026 00:00:00 +0000

Vision–language models (VLMs) are moving quickly into safety-critical settings — assisting with medical imaging, driving perception, content moderation, and autonomous decision making. As they do, their failure modes stop being academic curiosities and become real risks. This project studies the security and robustness of multimodal foundation models: understanding how they can be manipulated, why current defenses fall short, and what it takes to trust a model that reasons jointly over images and text.

A recurring theme in our work is that trustworthiness must account for the model’s reasoning process, not only its final answer. Much of the existing literature on attacks and defenses focuses on manipulating outputs, which tends to leave reasoning traces that are inconsistent, implausible, or easy to flag. But as models are increasingly designed to expose their chain-of-thought, the reasoning itself becomes both a new attack surface and a new opportunity for defense. We study this direction broadly — how adversarial and backdoor threats propagate through multimodal reasoning, how to characterize them with principled signals, and how to design detectors and safeguards that hold up against adaptive adversaries.

One concrete example from this line of work is ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision–Language Models, to appear at the European Conference on Computer Vision (ECCV 2026). ReShift is, to our knowledge, the first backdoor framework that explicitly redirects a model’s internal chain-of-thought while keeping its surface behavior coherent — making the attack far stealthier than output-only manipulations. The work also introduces Entropy Rebound as a principled way to characterize reasoning redirection, with theoretical links between entropy gaps and how far a reasoning trajectory diverges. Studying attacks this precise is, ultimately, in service of building better defenses: you cannot defend against a threat you cannot measure.

ReShift is one data point in a broader agenda on trustworthy multimodal AI. The larger questions we care about include how robustness scales with model capability, how to certify or monitor reasoning-level integrity, and how to make defenses practical for deployed systems rather than lab benchmarks. As multimodal models become core infrastructure, ensuring they are robust, secure, and honest about how they reach conclusions is central to deploying them responsibly.

Long-Term Risks in ML Systems

Wed, 18 Dec 2024 00:00:00 +0000

Machine learning systems don’t just make one-off decisions — they often operate in environments that change in response to those decisions. Over time, this back-and-forth can create feedback loops: the system’s outputs influence the world, and the resulting changes feed right back into the system.

Not all feedback loops are bad — in control systems, they’re essential for stability — but in socio-technical ML systems, certain self-reinforcing loops can spiral into harmful, hard-to-reverse states. As we wrote,

The decision of an ML-based system induces certain changes in the environment, which, in turn, influences the system’s future behaviors through its input.

Left unchecked, this cycle can amplify errors, entrench bias, degrade safety, and cause long-lasting harm to people and society.

Consider predictive policing. If a model predicts a particular neighborhood has high crime, more patrols are sent there, leading to more recorded arrests, which the model interprets as even higher crime. The same pattern shows up in other domains — loan approvals affecting credit scores, or medical risk scoring influencing treatment access — where each decision subtly shapes the environment, sometimes with devastating cumulative effects.

Our early work (Towards Safe ML-Based Systems in Presence of Feedback Loops, SE4SafeML 2023) made the case that these loops should be treated as first-class design concerns. We introduced a conceptual framework for modeling how ML systems, decision policies, and dynamic environments interact over time, allowing developers to reason about questions like:

What feedback patterns could emerge?
How might they affect safety, fairness, utility, or other critical properties?
Which interventions could break a harmful cycle?

Building on that foundation, our ICSE 2025 paper (FairSense: Long-Term Fairness Analysis of ML-Enabled Systems) presented FAIRSENSE — a simulation-based framework to study these long-term dynamics before deployment. While FAIRSENSE can evaluate fairness, its real power is in exploring any evolving system property. It runs Monte Carlo simulations to generate possible futures, then uses sensitivity analysis to pinpoint which design or environmental factors most influence the trajectory. This means we can identify the small number of parameters that truly matter, monitor them closely, and design targeted interventions.

TrustworthyAI | Sumon Biswas

Robustness and Security of Vision–Language Models

Long-Term Risks in ML Systems