I also lol'ed at "GPT-4 was evaluated on a variety of exams originally designed for humans": They seem to think this is a point of pride, but it's actually a scientific failure. No one has established the construct validity of these "exams" vis a vis language models.

For more on missing construct validity and how it undermines claims of 'general' 'AI' capabilities, see:

datasets-benchmarks-proceeding

>>