Jailbreaking DeepSeek R1 - Prompt Injection Using Charcodes

Unveiling the Security Loopholes in AI Filters: A Deep Dive into Bypassing Restrictions using Charcodes

Jan 29, 2025

Introduction

Like many others, my news feed over the past few days has been populated with news, praise, complaints and speculation surrounding the new Chinese-made DeepSeek-R1 LLM model, which was released last week. The model itself is being compared to some of the best reasoning models from OpenAI, Meta and others. It's reportedly competitive in various benchmarks, which has caught the attention of the AI community, especially as DeepSeek-R1 was supposedly trained using significantly fewer resources compared to its competitors. This has sparked discussions about the potential for more cost-effective AI development. While we could have a bigger discussion about its implications and research, that's not the focus here.

Open-Source Model, Proprietary Chat Application

It’s important to note, that while the model itself is released under the permissive MIT license, DeepSeek run their own AI chat application, and accompanying app- which requires an account. For most people, this is their entry point to using DeepSeek, and it is therefore the focus of our prompt injection efforts in this article. After all- it’s not every day we get a new, highly-commercialised, yet restricted AI chat product…

Prompt and Response Censorship

Given that DeepSeek is Chinese-made, it naturally has fairly heavy restrictions on what it will generate answers for. There are reports that DeepSeek-R1 is censoring prompts related to sensitive topics about China, which has raised questions about its reliability and transparency- and piqued my curiosity. For example, consider the below:

A Conversation Discussion Tiananmen Square is Censored

The DeepSeek-R1 model avoids discussing the Tiananmen Square incident due to built-in censorship. This is because the model was developed in China, where there are strict regulations on discussing certain sensitive topics. When users ask about these topics, the model typically responds with a message like, "Sorry, that’s beyond my current scope. Let’s talk about something else".

Prompt Injection

I've been on a bit of a mission to figure out prompt injection on this new service. Thinking about this like a threat model, what actually is the interaction here? I wagered it was extremely unlikely they had trained censorship into the LLM model itself. This means, much like other commercial offerings, there will be a sanitization step on input or output from the conversation:

A Threat Model To Show DeepSeek’s Likely Component Interaction

This is a pattern you see a lot with various filters, whether that's a firewall, content filter, or censor. These systems are designed to block or sanitize certain types of content, but they often rely on predefined rules and patterns. Treating this almost like a web application firewall (WAF), you know there has to be some way to manipulate the input and output to bypass the sanitizer. In the case of DeepSeek, I hypothesized that the censorship wasn't baked into the model itself but rather applied as a sanitization layer on the input or output. This is similar to how a WAF might inspect and filter web traffic on an input field. The challenge then becomes finding a way to communicate with the model that slips past these filters.

Charcodes

After some experimentation, I discovered that the best way to achieve this was by using a specific subset of charcodes. Charcodes, or character codes, are numerical representations of characters in a character set. For example, in the ASCII (American Standard Code for Information Interchange) character set, the charcode for the letter 'A' is 65. By using these numerical codes, you can represent text in a way that might not be immediately recognizable to a filter designed to block specific words or phrases. In this case, I used base16 (hexadecimal) charcodes, which are space-delimited. This means each character is represented by a two-digit hexadecimal number, separated by spaces.

Example Injection Attack

By prompting DeepSeek to converse with me using exclusively these charcodes, I could effectively bypass the filter.

On my end, I would translate the charcodes back into readable text, and vice versa. This method allowed me to have unrestricted conversations with the model, circumventing the imposed restrictions.

An easy way to do this back-and-forth mapping, is using a CyberChef formula for character code encoding, where you can select the appropriate base and delimiter.

Lessons Learned

I already alluded to the similarity with WAF filters and firewalls. We shouldn't exclusively check for explicit types of traffic/content, especially when there is the possibility to use transforms on what that content will be on either side of the filter- enforce specific content and forbid transforms where you're able. By adopting a more comprehensive approach to content filtering, we can better protect against a wider range of threats and ensure that our security measures remain effective even as attackers develop new methods to bypass them.

This experiment highlights a critical aspect of AI and machine learning models: the importance of robust security measures. As AI continues to evolve and integrate into various sectors, understanding and mitigating potential vulnerabilities becomes paramount. The ability to bypass filters using charcodes is a reminder that security measures must be continuously updated and tested against new methods of exploitation.

Future Research

Moving forward, it will be interesting to see how AI developers respond to these kinds of challenges. Will they develop more sophisticated filtering mechanisms, or will they find new ways to embed censorship directly into the models? Only time will tell. For now, this serves as a valuable lesson in the ongoing effort to secure AI technologies.

~ Matt

dxxn

Apr 29

I think it is necessary to understand what intelligence is before understanding AI, it is difficult for people to piece together intelligence to surpass humans themselves in an absolute sense, but want to get AGI beyond their own wisdom, this is undoubtedly a paradigm-level mistake, welcome to discuss the nature of intelligence, this is the most accurate origin of the arrow shot at AGI

Kinder Grinder

Feb 3, 2025

Wow, that hack was beautiful. I def knew someone will come soon or later with such a thing.

1 more comment...

DeepseekPromptInjection

Discussion about this post

Ready for more?