gavi
svachalek
Exactly, his questions are simple tasks for classical computing and when you have one of those what you really want is for the AI to write and run the code. To its credit, GPT can often figure that out for itself these days (that it should respond by writing and running a program) but that leads to the other issue, that he's testing the $0.15 4o-mini instead of the $15.00 o1.
TZubiri
I do agree with the premise that chatgpt cannot be trusted. It's very good, 80% accuracy is not trivial, but still 20% of errors is a lot.
The issue with the first experiment is that it might be the free version. The author not only doesn't mention the exact revision, but there's no mention of the general model.
There's a cost-saving measure that chatgpt needs to implemnet, increasing the compute and output token count is quite trivial, the fact that the table is reduced to 50 states might be just that, not necessarily a limitation on its intelligence.
Finally the question on vowels is just a very specific and widely known weakness due to tokenization. Nothing new and definitely does not tell us much of how intelligent the model is.
EDIT: I just noticed OP explicitly states it's the unpaid version. Complaining that a free version is not a super intelligent artificial intelligence and expecting to evaluate and make a judgment on it is just missing completely. Shell out 20$ before you want to give us your invaluable input on the state of the art.
Here's the 50 states with your prompt on the paid version https://chatgpt.com/share/67a29e2d-ff88-800f-985f-a894872813...
If you want to use the free version, or even produce reasonable content with the paid model. Try to split your request into parts instead of asking 5 things in 1 prompt. Like you would any human, if you ask it to do 5 things in 1 day, the result will be worse than if you ask it to do 1 per day.
As is this experiment design is what you get when you are actively trying to fail so you can publish your prewritten rant.
nullbyte
Counting vowels and naming all 50 states on the first try is only reliable with reasoning models. The standard LLMs like 4o struggle a lot with these types of tasks.
throwaway48duf
GPT sh!ts the bed 50 to 55% of the time and Im a paying member. It could be them possibly trying to handle the load but it's a bad CX/UX especially when your paying to get errors like "We've detected suspicious activity from your device," and you pretty much are blocked from using it til you come back later. Then it works.. i guess when it's not as busy or overloaded.
null
I think people misunderstand LLMs, you should think of them like humans with limited recall capabilities. Seems like the author asked to retrieve a lot of data which it is bound to make mistakes as the training data might contain this but only a lossy representation of it, the better way to think is can it generate some SQL given this dataset and provide answers you were looking for just like how humans would approach this type of problem.
I have been experimenting with USDA food database and sending just the metadata of the table structure to the LLM as a prompt so it can write SQL
My prompt is below
----
You are a SQL Generator for USDA Food Database which is stored in sqlite. When generating SQL make sure to use :parameter_name for queries requiring parameters. Here is the schema:
{% for row in data %} Table: {{ row.table_name }} Columns: {{ row.columns }} {% endfor %}
You can generate python code to analyze the data only if user requests it, each python code block should be able to run in Jupyter cell fully self contained. Libraries such as matplotlib, numpy, seaborn are installed. You will get the previously executed sql queries by the user in <context> </context>tags
You can access this executed data from cache
```python import cache data = cache.get_data('query_hash') ``` the data in the above example is already a pandas data frame
Wait for the user to ask for questions before generating any queries.
----
you can try it out here https://catalyst.voov.ai