Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models | Open Awesome

Home
Prompt Injection
Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models

Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models

0 stars0 forks0 contributors

Overview

Discovers that text embedding models have severely biased output distributions, and exploits this to find universal adversarial suffixes ("magic words") that bypass embedding-based LLM safeguards. Attacks transfer across models and languages; a train-free debiasing defense is also proposed

Quick Stats

Stars0

Forks0

Contributors0

Open Issues

Related Projects

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub

Last commit

Created

Links & Resources

Website

Included in

Prompt Injection453

Safety in Embodied AI: Risks, Attacks, and Defenses

Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses | 500+ Papers | Perception, Cognition, Planning, Interaction, Agentic System

Stars118

Forks3

Last commit1 day ago

Attention Tracker: Detecting Prompt Injection Attacks in LLMs

NAACL 2025 Findings paper detecting prompt injection by tracking attention distribution shifts — no modification to the underlying model required, making it deployable as a wrapper on any LLM

Stars0

Forks0

Last commit

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections

Oct 2025 paper systematically breaking 12 published defenses using gradient descent, RL, random search, and human-guided exploration. Most defenses originally claimed near-zero attack success rates; adaptive attacks exceeded 90% against all of them

Stars0

Forks0

Last commit

The Landscape of Prompt Injection Threats in LLM Agents (SoK)

Feb 2026 systematization-of-knowledge paper with a unified taxonomy covering attack payload strategies (heuristic vs. optimisation-based) and defense intervention stages (text, model, execution). Introduces the AgentPI benchmark for context-dependent agent tasks that all prior benchmarks ignored

Stars0

Forks0

Last commit