I also lol'ed at "GPT-4 was evaluated on a variety...

I also lol'ed at "GPT-4 was evaluated on a variety of exams originally designed for humans": They seem to think this is a point of pride, but it's actually a scientific failure. No one has established the construct validity of these "exams" vis a vis language models.

For more on missing construct validity and how it undermines claims of 'general' 'AI' capabilities, see:

https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/084b6fbb10729ed4da8c3d3f5a3ae7c9-Abstract-round2.html

Like 15 Mar 2023 at 19:56 | Open on dair-community.social