OpenAI announced its newest artificial intelligence model, called GPT-4o, which will soon power some versions of the company’s ChatGPT product. The upgraded ChatGPT can swiftly respond to text, audio and video inputs from its real-time conversational partner – all while speaking with inflections and wording that convey a strong sense of emotion and personality.
The company demonstrated the emotional mimicry of the new voice mode during a supposedly live OpenAI presentation, featuring both the ChatGPT mobile app and a new desktop app, on 13 May. Speaking in a female-sounding voice and responding to the name ChatGPT, the new AI’s conversational capabilities seemed more akin to the personable AI voiced by Scarlett Johansson in the 2013 science fiction film “Her” than to the more canned and robotic responses of typical voice assistant technologies.
Advertisement
“The new GPT-4o voice-to-voice interaction more closely parallels human-human interaction,” says Michelle Cohn at the University of California, Davis. “A big part of this is the short lag times… but an even bigger part is the level of emotional expressiveness the voice generates.”
During a conversation with company CTO Mira Murati and two other employees, the GPT-4o-powered ChatGPT advised OpenAI’s Mark Chen on his heavy and fast-paced breathing by saying “Whoa, slow down, you’re not a vacuum cleaner” and then suggesting a breathing exercise. The AI also visually examined a drawing by OpenAI’s Barret Zoph, which included words and a heart, by responding in gushing tones: “Aw, I see you wrote I love ChatGPT, that is so sweet of you.”
The new ChatGPT also verbally instructed its conversational partners on solving a simple linear equation, explained the function of computer code and interpreted a chart showing temperature lines peaking in the summer months. When prompted, the AI even retold a made-up bedtime story several times while switching between increasingly dramatic narrations and singing the ending.
The new voice mode will first become available for paid subscribers of ChatGPT Plus in the coming weeks, said Sam Altman, CEO and co-founder of OpenAI, in a post on the platform X.
ChatGPT was able to recover conversationally even from the occasional technical glitch. When asked to interpret the facial expressions and emotions in a selfie of OpenAI’s Zoph, the AI first suggested that it was looking at a wooden surface from a previous image before being prompted to evaluate the latest image.
“Ahh, there we go – it looks like you’re feeling pretty happy and cheerful with a big smile and a touch of excitement,” said ChatGPT. “Whatever is going on, it looks like you’re in a good mood. Care to share the source of those good vibes?”
When told that it was because the live demo with ChatGPT was showcasing how “useful and amazing you are”, the AI responded, “Stop it, you’re making me blush.”
But Murati acknowledged that the updated version of ChatGPT powered by GPT-4o – which the company says will eventually be made available to even free ChatGPT users – comes with new safety risks because of how it incorporates and interprets real-time information. She said that OpenAI has been working on building in “mitigations against misuse”.
“Having seamless multimodal conversations is really difficult, so the demos are impressive,” says Peter Henderson at Princeton University in New Jersey. “But as you add more modalities, safety becomes much more difficult and important – it will likely take some time to identify potential safety failure modes with such an expansion of inputs that the model makes use of.”
Henderson also described himself as “curious” to see OpenAI’s privacy terms once ChatGPT users start sharing input such as live audio and video, and whether free users can opt out of data collection that may be used to train future OpenAI models.
“Since the model appears to be hosted off-device, the fact that you could be sharing your desktop screen with the model over the internet or continually recording audio or video seems to scale up the challenge for this particular product launch, if the plan is to store and use that data,” says Henderson.
A more anthropomorphised AI chatbot also represents another threat: a bot that can fake empathy through voice conversations could potentially sound both more personable and persuasive to people, according to research studies by Cohn and her colleagues. That raises the risk of people being more inclined to trust potentially inaccurate information and prejudiced stereotypes generated by large language models such as GPT-4.
“This has important implications for how people both search and receive guidance from large language models, particularly as they do not always generate accurate information,” says Cohn.
Topics: