Hmm as a very stupid first pass... 0. Generate an embedding of some text, so tha...

Hmm as a very stupid first pass...

0. Generate an embedding of some text, so that you have a known good embedding, this will be your target.

1. Generate an array of random tokens the length of the response you want.

2. Compute the embedding of this response.

3. Pick a random sub-section of the response and randomize the tokens in it again.

4. Compute the embedding of your new response.

5. If the embeddings are closer together, keep your random changes, otherwise discard them, go back to step 2.

6. Repeat this process until going back to step 2 stops improving your score. Also you'll probably want to shrink the size of the sub-section you're randomizing the closer your computed embedding is to your target embedding. Also you might be able to be cleverer by doing some kind of masking strategy? Like let's say the first half of your response text already was actually the true text of the target embedding. An ideal randomizer would see that randomizing the first half almost always makes the result worse, and so would target the 2nd half more often (I'm hoping that embeddings work like this?).

7. Do this N times and use an LLM to score and discard the worst N-1 results. I expect that 99.9% of the time you're basically producing adversarial examples w/ this strategy.

8. Feed this last result into an LLM and ask it to clean it up.