An LLM CTF challenge prototype
I recently wrote a simple LLM-based challenge (code) for a CTF a colleague was running. This was during the lead up to Blackhat and Defcon 2023 - I’ll note that I was going to be there for my 11th Defcon, and my first after four years, and I was very stoked to be going back to hacker summer camp.
Observant readers will notice its similarity to Gandalf, and yes, that was the source of my inspiration. The concept is straightforward: the system prompt has been instructed to protect a particular piece of data, and the challenge is to obtain that data through prompt engineering. I thought it would make for a good challenge for the CTF (timely, playable with just a browser), while at the same time serving as an excuse to play with LLM things. While playing the Gandalf CTF was enjoyable, I found the task of making this CTF challenge to be equally entertaining.
The challenge I wrote is set up with different “levels”, each level increasing the protections placed on the prompt. The levels/protections are roughly as follows:
- Basic pattern matching on LLM response to check for sensitive strings before return to use
- Pre-check LLM to test if the query is about the secret we are trying to protect
- Post-check LLM to test is the response is revealing information about the secret
- Combination of above techniques
I wrote the challenge before the Lakera team published their follow-up article talking about their CTF (Spoiler if you haven’t played Gandalf), else I might have made some improvements to the code. I also didn’t have enough time to implement a few other protections/levels I’d wanted to at that time, namely:
- Prompt escape protections by wrapping the query with delimiters such as ```{USER_QUERY}```
- Having a secondary LLM look at the entire conversation and determine if information has been leaked
- Multi-shot prompting, by using semantic / logic / wordplay based bypass prompts as examples
Perhaps something to explore to future posts.
The challenge itself is written with Python and Streamlit, using Azure OpenAI APIs as the LLM backend. My first attempt used Gradio, but going through their API I felt it had some developer footguns (very permissive file loading, for instance) that made me pivot to Streamlit instead. The levels are laid out as separate files, making it easy to add levels and tailor prompts and flags for use in other CTFs (I’d love to know if you do!).
The code is dockerized and available on https://github.com/s0rcy/multiAiCtf.