@jszym yeah it's definitely not a completely accurate description, the vision models are even more prone to hallucination than just plain text!