Race for High-Quality Data Continues

I watched some time ago the presentation of the GPT-4o model from Open AI. If you haven't seen it, this might be interesting for you.

Early today, I was curious to see the progression in time through the released versions of GPT from Open AI. Who better to ask than ChatGPT in this case? Here's what I got:

Version	Date Release
GPT-1	June 2018
GPT-2	November 2019 (partially released in February 2019)
GPT-3	June 2020
GPT-3.5	November 2022
GPT-4	March 2023
GPT-4o	May 2024
GPT-5	?

We can see Open AI had major releases every 6 months or so until they hit GPT-3, which made them stand out. Then they took it slow with GPT-3.5, perhaps scared of their own success (or not being ready for it). ChatGPT tells me they already offered API access since GPT-3. I thought they opened that up later, in GPT-3.5.

After another 6 months or so came GPT-4, which to me seems it has been worked on while work was being done on GPT-3.5. Also note that GPT-4 comes almost 3 years after GPT-3, while before, major releases came yearly.

The current version 4o of GPT was released about a year after GPT-4.

From a ranking (based on community votes, apparently) I found on Grok's X account (the xAI), it looks like GPT-4o is still considered the number 1 LLM, but Google's Gemini is the closest and, since the xAI bragged about it on their account, it's natural their model is doing well, and that's on the third spot:

We see strong competition is coming for Open AI and their GPT model. We also hear the next version GPT-5 will be another game changer. The question is, with the intense competition that is coming, will Open AI (or any of them) take the time to be cautious where they lead this technology? I certainly doubt that. They push at full throttle, all of them, in my opinion.

Open Ai has the advantage of still having a slight edge to competition, but the BIG disadvantage of not owning massive amounts of data to train their models. While Google, in particular, Meta, and to some degree, X, don't have this issue. I wonder if X added long-form content for this reason...

There is definitely a race to acquire more quality data to train these models. At the rate these LLMs are developing, it is predicted they will run out of quality data by the end of the decade or early 30s. There are solutions after that, of course. But not as good as having access to massive amounts of quality data. By the way, "high-quality data" is the expression ChatGPT used when I asked it when LLMs will run out of training data.

I ran out of credits, or I would have asked it to elaborate on the internet as a training source:

A key factor is the rate at which new, high-quality data is generated and curated. Currently, models are trained on vast datasets that include large portions of the internet, books, academic papers, and other sources. However, as LLMs grow larger, they require more data to achieve further improvements, and eventually, the available quality data may become a limiting factor.