The Waluigi Effect: After you train an LLM to satisfy...

The Waluigi Effect: After you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P.

https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post

Like 7 Mar 2023 at 0:42 | Open on mas.to