Our new flagship model that can reason across audio, vision, and text in real time

Overview

  • GPT-4o is a flagship generative AI model that can process text, speech, and video. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation.

Features

GPT-4o accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs.

  • Language Capabilities: GPT-4o matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API.
  • Vision and Audio Understanding: GPT-4o is especially better at vision and audio understanding compared to existing models.
  • Availability: GPT-4o is available for free to all ChatGPT users, including those on the free plan.
GPT-4o
GPT-4o

Pre-GPT-4o Voice Mode

  • Latency: 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average

Process:

  1. Audio transcription to text using a simple model
  2. Text processing using GPT-3.5 or GPT-4
  3. Text-to-audio conversion using a simple model

Limitations:

  • GPT-4 loses information about tone, multiple speakers, background noises
  • Can’t output laughter, singing, or express emotion
  • Trained end-to-end across text, vision, and audio modalities
  • Single neural network processes all inputs and outputs

GPT-4o offers a significant improvement over the previous Voice Mode, with much faster response times and the ability to process multiple modalities simultaneously. This allows for more natural and human-like interactions, but there is still much to be discovered about its capabilities and limitations.

GPT-4o
GPT-4o

Model evaluations

Here are the key points about GPT-4o:

  • GPT-4o Overview: GPT-4o is OpenAI’s latest flagship model that can process text, speech, and video.
  • Features: GPT-4o accepts any combination of text, audio, and image inputs and generates any combination of text, audio, and image outputs.
  • Language Capabilities: GPT-4o matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API.
  • Vision and Audio Understanding: GPT-4o is especially better at vision and audio understanding compared to existing models.
  • Availability: GPT-4o is available for free to all ChatGPT users, including those on the free plan.
GPT-4o
GPT-4o

GPT-4o’s Reasoning Capabilities

  • Achieves a new high score of 88.7% on 0-shot COT MMLU (general knowledge questions)
  • Sets a new high score of 87.2% on traditional 5-shot no-CoT MMLU
  • Evaluated using the new simple evals library

Note: Llama3 400b is still in training, and its results are not yet available for comparison.

Model safety and limitations

GPT-4o Safety Features
  • Designed with safety in mind across modalities
  • Techniques used: filtering training data, refining model behavior through post-training
  • New safety systems created for voice outputs
  • Evaluated according to Preparedness Framework and voluntary commitments
  • Risk assessments: cybersecurity, CBRN, persuasion, and model autonomy all scored Medium or below
  • External red teaming with 70+ experts to identify risks
  • Safety interventions implemented to mitigate risks
  • Novel risks associated with audio modalities being addressed
  • Limited audio outputs at launch, with further details to be shared in system card

Limitations of GPT-4o

  • Observed limitations across all modalities (examples to be shared)
  • While limitations exist, the developers are working to address these and ensure the safe use of GPT-4o.

Model availability

ChatGPT 4o
ChatGPT 4o
  • GPT-4o is now available in ChatGPT, with text and image capabilities rolling out today
  • Free tier and Plus users (with 5x higher message limits) can access GPT-4o
  • Voice Mode with GPT-4o will be available in alpha within ChatGPT Plus in the coming weeks
  • Developers can access GPT-4o in the API as a text and vision model

GPT-4o offers:

  • 2x faster response times
  • Half the price of GPT-4 Turbo
  • 5x higher rate limits
  • Support for audio and video capabilities will be launched for a small group of trusted partners in the API in the coming weeks

Valuable comments