@simon @shauna @hynek - only frontier models routinely find bugs with unit tests. 3.5 wrote vacuous tests in comparison to 4 or 4o
- once it fixed the bug via monkey patching before the test ran to make it pass (malicious compliance!)
- the bots write so many unit tests that after a while quantity becomes a quality all of its own & the value comes with the next change I make, I'll see how sensitive the rest of the app was to a change in any part of the app (which points out design flaws)