One overlooked fact about the wonders of LLMs is that they are inherently stochastic. And with thousands or millions of people playing with them, just like with any other lottery some will get a lucky sequence of numbers and the model will do something interesting or surprising.
That is why they need to be evaluated like any other stochastic widget. Individual realizations are irrelevant unless highlight failure. Careful prompting to accomplish a given result is akin to sophisticated pulling the lever of a one armed bandit. Cherrypicking is abundant.
@filippie509 Interesting. This is the ‘infinite monkeys banging away on typewriters to reproduce Shakespeare’ scenario!
@filippie509 How can their performance be evaluated, anyway? The output is a string of ASCII characters that needs a human to interpret it. It is inherently a toy, like a kaleidoscope.