On Tuesday, OpenAI announced the arrival of GPT-4, its latest language-model iteration which seemingly outperforms exam creators.
According to OpenAI, the improved model shows “human-level performance on various professional and academic benchmarks.” GPT-4 doesn’t only pass a simulated exam, but achieves marks within the top 10% of test takers, as opposed to its predecessor GPT-3.5, which scored at the bottom 10%.
The improved language model performs well on other exams such as SAT Math (700 out of 800). Still, it doesn’t perform universally well, scoring only 2 on AP English Language and Composition.
One consideration is that OpenAI’s GPT series is in itself a series of engines which more or less regurgitate already-published material. It is trained on human-created data sets and reassembles them in order to address a query. Sometimes the answers are correct – other time they’re wrong. Recalling details for exams might not seem overly impressive given that it’s a machine, but perhaps the language model’s performance provides more damning indirect criticism of education.
OpenAI CEO Sam Altman acknowledged the engine’s limitations, saying: “it is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.”
Notably, GPT-4 is no longer merely a language model, but a multi-modal model. It is designed to accept queries via test and image inputs, and answers with text-based returns. Initially, the app is available to wait-listed GPT-4 API and ChatGPT plus subscribers in a text-only capability. Image-based input is in the process of being refined.
Despite the addition of visual inputs, OpenAI is not being transparent about how its model was created. The startup has made the decision not to provide information regarding its size, its training methodology, or the data used in the process.
This is somewhat worrisome, and also innkeeping with Peter Thiel’s characterisation of artificial intelligence i.e. if ‘crypto’ is somewhat ‘libertarian’, then AI is certainly somewhat ‘communist’. The apple really doesn’t fall far from the three.
In its technical paper, the company said: “given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”
Greg Brockman, president and co-founder of OpenAI, used a live YouTube video stream to compare GPT-4 and GPT-3.5. Brockman asked the models to summarize the OpenAI GPT-4 blog post in a single sentence with all of the words beginning with the letter “G.”
GPT-3.5 didn’t attempt the feat. Meanwhile, GPT4 returned something of a gimmicky answer: “GPT-4 generates ground-breaking, grandiose gains, greatly galvanizing generalized AI goals.” When Brockman instructed the model that using the term ‘AI’ doesn’t count, then the revised response was all the more G-laden.
He also setup the bot to examine 16 pages of US tax code in order to determine the standard deduction for a married couple named Alice and Bob with particular financial conditions. The correct response and an explanation of the calculations used were provided by OpenAI’s model.
The iteration appears to be better at handling inputs, as well as accepting text-based inputs. This opens the door to more sophisticated inputs down the line such as speech recognition. Supposedly, GPT-4 should be less likely to go off the deep end than its predecessors.
“We’ve spent six months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results on factuality, steerability, and refusing to go outside of guardrails,” the organisation says.
The ‘far from perfect’ safety levels are certainly something users are familiar with. Both Microsoft and Google have had rocky debuts using their own models. OpenAI even acknowledges these issuing, noting that GPT-4 “hallucinates facts and makes reasoning errors” like its predecessors, but they insist the model does so to a lesser extent now.
“While still a real issue, GPT-4 significantly reduces hallucinations relative to previous models (which have themselves been improving with each iteration),” the company explains. “GPT-4 scores 40 percent higher than our latest GPT-3.5 on our internal adversarial factuality evaluations.”
Despite ongoing concern about AI risks, there’s a rush to bring AI models to market. On the same day GPT-4 arrived, Anthropic, a startup formed by former OpenAI employees, introduced its own chat-based helper called Claude for handling text summarization and generation, search, Q&A, coding, and more. That’s also available via a limited preview.
It also has limited public accessibility as of this week.