OpenAI Unveils GPT-Realtime Speech-To-Speech Model With Multimodal Support And Advanced Conversational Capabilities

2025-09-01 14:03:09

In Brief

OpenAI released the gpt-realtime speech-to-speech model with multimodal support, advanced conversational skills, and strong audio reasoning performance.

Artificial intelligence research organisation OpenAI announced the general availability of its Realtime API, now enhanced with features that allow developers and enterprises to build robust, production-ready voice agents. The API supports remote MCP servers, image inputs, and phone calling via Session Initiation Protocol (SIP), enabling more capable and context-aware voice applications.

Alongside the API, OpenAI has released its most advanced speech-to-speech model, gpt-realtime, designed to improve instruction following, function calling, and natural-sounding speech. The model can interpret complex prompts, switch languages mid-sentence, reproduce alphanumeric sequences accurately, and capture non-verbal cues. Two new voices, Cedar and Marin, are also available, offering more expressive and human-like intonation. Existing voices have been updated to incorporate these enhancements.

The Realtime API processes audio directly through a single model, reducing latency and preserving nuance, unlike traditional pipelines that chain separate speech-to-text and text-to-speech models. gpt-realtime has been trained in collaboration with users to excel in real-world applications such as customer support, personal assistance, and education. Benchmark evaluations show substantial improvements in reasoning, instruction adherence, and function calling accuracy compared to previous models.

Additional updates include asynchronous function calling, allowing long-running operations without interrupting ongoing conversations, further supporting seamless, production-ready voice experiences.

OpenAI Expands Realtime API With MCP Support, Image Inputs, SIP Integration, And Cost-Saving Controls For Voice Agents

OpenAI’s Realtime API now includes new features designed to simplify integration and expand capabilities for production-ready voice agents. Developers can enable remote MCP support by linking a session to an MCP server URL, allowing the API to manage tool calls automatically and access additional functionalities without manual setup.

The gpt-realtime model now supports image inputs, enabling the system to incorporate photos, screenshots, and other visuals alongside audio or text. This allows users to ask context-specific questions about what they see, while developers retain control over which images are shared and when.

Additional improvements include Session Initiation Protocol (SIP) support for connecting apps to phone networks and PBX systems, as well as reusable prompts that let developers save and deploy pre-configured instructions, tools, and example messages across multiple sessions.

The generally available Realtime API and gpt-realtime model are now accessible to all developers, with pricing reduced by 20% compared to the previous gpt-4o-realtime-preview. New controls for conversation context allow for smarter token management, reducing costs for long-running sessions. Documentation, a Playground for testing, and a Realtime API prompting guide are available to support developers in adopting these features.

GPT-3.69%

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

0/400

No comments

MpostMediaGroup

Topic
#Double Rewards With GUSD
5k Popularity
#DOGE ETF Launch
9k Popularity
#My Top AI Coin
26k Popularity
#Gate Alpha New Listings
49k Popularity
#Altcoin Market Rebound
35k Popularity

Sitemap