OpenAI Refining Voice Cloning with Voice Engine

The voice cloning industry is growing every day, and even more rapidly in recent times as AI models improve at making synthetic voices sound realistic. And now, OpenAI finally makes its debut with Voice Engine, but they are keen on entering the industry with it responsibly.

OpenAI's Voice Engine will allow users to record 15-second voice samples, and it will generate a copy of the voice. Its application crosses many areas of the voice industry, including audiobooks, podcasts, voiceovers, and virtual assistants. However, OpenAI hasn't announced when Voice Engine will be publicly available, as they are taking time to ensure that it is as safe as it can be.

Interestingly, Voice Engine's AI model has been around for a while now. It's been available as a "read aloud feature in the AI chatbot, ChatGPT, and even that was already impressive. Where it's training data is from, however, isn't so clear. They'd only say that it was trained on some public and licensed data.

Training data is a crucial type of information for AI providers. It is confidential for most of them, as it is some sort of competitive advantage between themselves. More so, they are also potential leads to IP-related issues, further discouraging them from talking much about them. Already, OpenAI is already facing allegations over IP law violations by training their models on copyrighted content without actually attributing them to the creators or providing incentives, so they'd rather be discreet with their information on training data.

In an actual sense, it is difficult to create useful AI without real-world samples, including copyrighted content, and so OpenAI pitches that fair usage of such works be allowed as long as it is developmental for the models when training them.

Voice Engine's training isn't based on user data, however. “We take a small audio sample and text and generate realistic speech that matches the original speaker,” said Jeff Harris, a product staff member at OpenAI. “The audio that’s used is dropped after the request is complete.” And so, Voice Engine analyses the (15-second) voice sample provided and the text to be read and generates a voice that matches the sample, all on the go as the request is made.

There are already-existing technologies such as ElevenLabs, Replica Studios, Papercup, and Respeecher, but unlike many of them, there really aren't controls to adjust to the pitch, cadence, and tone of a voice. No fine-tuning knobs, either. You give it a 15-second sample, and it generates a voice for the request. However, something interesting it does is carry on the expressiveness of the voice in the sample to generations of the synthetic voice. That is, if you sound excited in the sample, the generated voice will sound just as similar.

There are concerns as to what will become of creators in the voice industries and how this tool will affect them, considering how good enough these models are to replace most of them. There are already existing platforms that have been deploying these AI cloning models to create content. To benefit these creators, voice actors, and the like, they are asked to sign rights to the use of their voices by these models so their clients get to use their synthetic versions.

While some AI providers try to find balance amidst the controversy over the ethical usage of copyrighted works by either creating deals with SAG-AFTRA (Screen Actors Guild - American Federation of Television and Radio Artists) to create and licence copies of the media artist union members’ voices, like Replica Studios is doing, or hosting a marketplace for synthetic voices that allows users to create a voice, verify it, and share it publicly, like ElevenLabs, OpenAI is taking a different approach.

OpenAI will establish no such labour union deals or marketplaces, at least not in the near term, and requires only that users obtain “explicit consent” from the people whose voices are cloned, make “clear disclosures” indicating which voices are AI-generated, and agree not to use the voices of minors, deceased people, or political figures in their generations. _{_Source}

What we have seen with deepfakes in recent times and what's possible in the future with these AI models continue to raise concerns about the ethical and responsible use of AI. OpenAI is implementing some measures to prevent misuse of Voice Engine.

For now, Voice Engine is only going to be available to a very small number of people—say, 10 developers. OpenAI is prioritising use cases that are “low risk” and “socially beneficial,” Harris says, like those in healthcare and accessibility, in addition to experimenting with “responsible” synthetic media.

Watermarks are placed in the voice clones generated with Voice Engine. They are inaudible identifiers embedded in the generations that enable them to know if a voice clone was created by Voice Engine and who developed it. It's not promised that it can't be walked around, but they are described as "tamper resistant," at least.

An example of Voice Engine's performance is how it used this voice sample to generate three audio clips. Generated clip 1, clip 2, and clip 3. The difference between the original clip and the generated ones isn't apparent, and unsuspecting listeners will unlikely be able to figure it out.

OpenAI states that there will be HD and non-HD voices, but a spokesperson at OpenAI also says that there really isn't a difference between both of them. They are priced differently, however, with HD costing twice as much as non-HD.

Until OpenAI releases Voice Engine to the public, they are focusing more on safety issues as they develop the AI voice cloning model. “What’s going to keep pushing us forward in terms of the actual voice-matching technology is really going to depend on what we learn from the pilot, the safety issues that are uncovered, and the mitigations that we have in place,” Harris said. “We don’t want people to be confused between artificial voices and actual human voices.”

By the way, make earnings with your content on Hive via InLeo while you truly own your account. If you're new, sign up in a few minutes by clicking here!

_References

Images 1, 2, 3, 4

_{Interested in more?}

_{_{Meet the Humane AI Pin: Voice, Gesture, AI – No Screens Needed!}}

_{_{The Link: Bridging Minds and Machines with Neuralink's Brain Chip}}

_{_{AI-coustics: Revolutionizing Audio Clarity with Generative AI Technology}}