AI Comes to Audiobooks

Could this be magic? Over the past decade, audiobook sales, driven by digital audio, have exploded. Sales in 2020 exceeded $1.3 billion, up 12% over 2019. The percentage of Americans 18 and older who have listened to an audiobook is now 46%, up from 44% in 2019. One thing hasn’t changed, however: the arduous production process for audiobooks.

But what if it could be done in a fraction of the time, weeks instead of months? And what if it could be done for a fraction of the cost, hundreds of dollars instead of thousands?

Introducing AI-enabled automated audiobook creation.

The value proposition

Producing audiobooks is expensive, so the appeal of automating audiobook creation is easy to understand. The traditional process can take two hours or more in the studio for one finished hour, and the average audiobook is eight hours long. A significant factor in the cost is the talent: brand-name talent is paid $1,000 or more per finished hour (PFH) of the audiobook, and with studio time and post-production added, the cost climbs quickly.

At the other end of the spectrum, audiobooks created with Amazon/Audible’s ACX (Audiobook Creation Exchange) can be priced with two models: royalty share or a fee. With a royalty share, the (often self-published) author finds a voice actor willing to invest their time and talent in the project, in exchange for perhaps 20% of the royalties. Outright fees are negotiable.

Findaway, another company with an ACX-style hybrid model, notes that the “average audiobook created with Findaway Voices has about 50,000 words and costs between $1,000 and $2,000.”

Wouldn’t it be great if publishers could get rid of “the talent” and reduce the long production cycles, push a button, and, presto, an instant audiobook, ready for sale? For smaller publishers and for authors with modest sales, it unquestionably would be. How about creating the audiobook for $500? Or less?

The value proposition for automated audiobook creation becomes a combination of cost and convenience, where convenience is a combination of both simplifying and vastly accelerating the production process. A few new startups claim to be able to produce audiobooks in days or even hours, rather than months.

Digital voice technology

Today’s focus on the AI-enabled auto-narration of audiobooks is fostered by the same technology that taught Siri to speak and Alexa to listen. Hey Siri, you almost single-handedly created a demand for user-selected artificial voices.

“There’s been some really good breakthroughs in text to speech becoming more humanlike over the last couple of years,” says Kane Simms, founder of VUX World. “You only need to listen to some of the celebrity Alexa voices, including Samuel L. Jackson, Shaquille O’Neal, and Melissa McCarthy.”

Underneath digital voice audio is text—text converted to speech. Siri knows stuff because Siri is reading from Wikipedia. Google’s Assistant easily accesses the text of millions of books scanned and indexed via Google Books.

Text to speech (TTS) creates artificially generated audio that sounds like a person talking. The perfect application for this has been voicebots on smartphones and voice assistants in the home. The holy grail has been to make their voices indistinguishable from human voices. And not just one voice but, as we see (and hear) with Siri and Google Assistant, multiple voices.

With that as a goal, making this same technology work for long audio, such as audiobooks, seems like a logical next step. The problem has been that a voice that sounds lifelike when it announces the weather doesn’t necessarily sound so lifelike after an hour of narration. Or does it?

“To create a synthetic voice that is capable of holding human attention for long periods of time—that’s quite a challenge,” Simms says. “The way that an audiobook narrator will read an audiobook, for example, is totally different to how one would answer a simple question such as ‘What time is it?’ In a book, you might need a news reader voice, an excited voice, a sad voice, a slow voice, a fast voice, a high-pitched voice, and you might need all of this on a single page.”

Bradley Metrock, CEO of Project Voice and of Digital Book World, has a different perspective. “With the current crop of high-end synthetic voices, 95% of people would not recognize that they’re artificially generated,” he says. “In 12–24 months they’ll have reached human levels.”

What kind of books are best?

There’s a general consensus that this technology works best for narrative nonfiction, with its steady cadences. Narrative is important, because more complex nonfiction, sometimes heavily illustrated and full of charts and graphs, is all but impossible for the current generation of technology to handle smoothly. Nonetheless, two of the startups, DeepZen and Scribe, are specifically targeting fiction.

The talent issue

Actors are central to audiobooks; well-known actors are a draw in themselves. A great deal of skill goes into recording an audiobook that delights the ear. It’s not just reading the book out loud, nor is it, per se, like acting. Great audiobook narrators are in a class of their own.

The professionals in this business are represented by a powerful union, SAG-AFTRA (Screen Actors Guild–American Federation of Television and Radio Artists), which describes itself as “the world’s largest labor union representing performers and broadcasters.” The organization offers the full slate of union perks: training, guaranteed minimum rates for recordings, and health and life insurance. Artificial voices require none of these benefits.

The union has two concerns about AI. On the one hand, replacing live actors with computerized voices is bad for business. Of great concern also is AI’s increasing capacity to clone human voices, with substantial risk that the voice owner will either be under-compensated or not paid at all.

Asked to comment on how AI could affect the demand for voice actors, a union representative responded: “Audiobook narration is a human storytelling enterprise, and by and large the amazing professionals who tell these stories feel strongly about human-to-human storytelling. But short of that, they want to be sure they are fairly compensated and that they have control over the use of the digitized voices created based upon their own. They also want their fans, the consumers, to be aware that they are purchasing a nonhuman performance, not one given by their favorite narrator.”

The various vendors mostly go out of their way to reassure customers of their great respect for human narration, while devoting their R&D efforts to making live humans an unnecessary, or at least optional, component of audiobook creation.

Can Google win this game?

High-quality TTS is a holy grail for Google, as it is also for most of the other big tech players, including Amazon, Apple, Facebook, IBM, and Microsoft. All of them see voice interfaces as central to the future of their software platforms. While the major use case is voice assistants, inevitably these evolve into meatier voice challenges, like reading articles on websites, voicing podcasts, and then, just as inevitably, to video or book-length content.

Companies can in fact build their own TTS platforms for little or no cost using some of the underlying technologies from these platforms. The new AI company Speechki was built in part this way, which seems to make the most sense—use the best of what’s already out there, further enhance it for the requirements of long-form audio, and then focus on the specific needs of book publishers and their authors.

Google has been rolling out a plethora of voice-related software, TTS, speech to text (transcription), and recently Translatotron 2, its speech-to-speech translation (S2ST) software, which combines several audio technologies: speech recognition, machine translation, and foreign-language speech synthesis of the translated text.

Last fall Google profiled an experiment called “Convert PDFs to Audiobooks with Machine Learning”(which can be seen on the Google Cloud Tech YouTube channel) that not only parsed a scientific article with complex page formatting but then read the article using DeepMind WaveNet, Google’s TTS software.

Simms thinks it’s possible Google could become a player in the AI audiobook market. “The advantage that Amazon and Google have is immense resources and the ability to gather large amounts of training data,” he says. “However, fundamentally, Amazon and Google are cloud service providers, and most of their technology will become part of their cloud offerings.”

“They’ve got the money,” Metrock says. “They can do whatever they want. You’re going to see Amazon and Google doing something in this space.”

The Audible problem

ACX is the Audible audiobook self-publishing platform, and included among the ACX Audio Submission Requirements section of its website is the following warning, presented as guidance: “Your submitted audiobook must be narrated by a human. TTS recordings are not allowed. Audible listeners choose audiobooks for the performance of the material, as well as the story. To meet that expectation, your audiobook must be recorded by a human.”

With Audible controlling as much as 50% of the audiobook market (depending on the type of content), its current voice policy is a major concern for all of the companies looking to get into the narration field.

Taylan Kamis, CEO of DeepZen, says, “We see Audible accepting these titles as a ‘when’ question rather than an ‘if’ one, as the technology develops and becomes more mainstream.” DeepZen advises its customers to “take a multiyear view” and have as many titles available for distribution as possible when Audible changes its position.

No other vendor has the same restrictions as Audible, so publishers can sell content produced with AI-generated voices on some 50 retail and library vendors including Apple Books, Google Play, Kobo, OverDrive, Scribd, Spotify, and Storytel.

The vendors

In this nascent field of automated audio conversion, two vendors appear to have moved ahead of the others: DeepZen and Speechki. There are several right behind them—and in other cases, off to the side, with variant voice technologies and services.

Interestingly, the startups devoted to rendering English-language books into English-language audiobooks feature vendors located just about anywhere other than the U.S.: Boden, Sweden; Islamabad, Pakistan; Kiev, Ukraine; and a few from Russia, including one from Siberia.

DeepZen

DeepZen was established in London in 2018, making it one of the older players in the field. It offers a self-serve portal: upload the book (EPUB format only) and the software will estimate the length and quote a price (roughly $160 PFH). With an average book estimated at 50,000 words, the price would be $800.

It’s far from “automatic.” After the publication file is loaded, the following steps take place:

• The manuscript is reviewed and pronunciation guidance requested for any unfamiliar words.

• The first version takes five to seven days to produce.

• Corrections then take one to two working days.

• Post-processing takes one to two working days.

• The aim is to complete a project in three weeks.

According to Marzia Ghiselli, head of publisher partnerships at DeepZen, all DeepZen voices are licensed and cloned from human narrators. The company does not use generic off-the-shelf voices from Amazon or Google. Pseudonyms are used to make it clear that the audiobook is voiced with AI technology.

The one notable exception to the use of pseudonyms is with the voice of Edward Hermann. Hermann, who passed away in 2014, was a prolific audio narrator and one of Audiofile magazine’s respected “Golden Voices.” According to Ghiselli, “We really liked his voice and decided to try and license it. We tracked down his agent in NYC, who then acted as liaison with his estate. His wife and son saw this as a fitting way to honor his legacy.”

DeepZen was able to clone his voice using old recordings, and any DeepZen customer can have their work digitally narrated by Hermann, provided they give his voice credit in the recording.

This certainly sidesteps the direct concerns of SAG-AFTRA: it actually brings income to Hermann’s estate that wouldn’t be earned any other way. The worst that can be said is that a living actor didn’t get an opportunity for a particular book.

This September, DeepZen announced its first substantial contract with a publishing organization: Ingram Content Group’s publisher clients now have a special portal to access DeepZen services and receive roughly a 7% discount on production pricing.

Holly-Blue Ross is a translation rights executive at SpringerNature in the U.K. and an early customer for DeepZen. “Working with DeepZen has given us the opportunity to play a more active role in selecting our content to be produced in the audiobook format,” Ross explains. “It also offers greater insight into how specific subject areas and disciplines perform in this format. We’re hoping to produce 10 more audiobooks before the end of 2022 using the DeepZen platform.”

Speechki

Where DeepZen takes over a month to produce an audiobook, Siberia’s Speechki promises to convert a book into an audiobook “in days” for as little as $500. It also offers to “create your audiobook with artificial intelligence in 15 minutes.”

“Publishers spend thousands of dollars and at least a few weeks to produce a single audiobook,” says Dima Abramov, cofounder and CEO of Speechki. “This process is slow, expensive, and highly complex. Publishers’ unit economy doesn’t work with traditional methods of audiobook production.” Because of these factors, Abramov estimates that only 5% of published works have been released as audiobooks.

Speechki’s mission includes making audiobooks available to everyone in their language of choice. It has 251 voices in 72 languages, including more than 50 American voices. It has already processed upward of 1,000 books, mostly in Swedish and Russian.

There are two levels of voices: so-called “common voices” are $500 per book, while more natural sounding “advanced voices” are $1,000 per book. Both include free “proof-listening” (by a human) and handling of requested corrections.

“We’ve had a lot of interest from publishers, and are running pilot programs with about a dozen U.S. companies,” says Bill Wolfsthal, a New York–based publishing consultant to Speechki. In addition to talking with all of the Big Five publishers, Speechki has met with a handful of independent and university presses, and Wolfsthal says Speechki is working to customize its service to fit specific needs.

Speechkit

Speechkit targets short-form content—blog posts, news, and podcasts. It offers an excellent do-it-yourself tool to evaluate the current state of TTS. The service is built from voices offered by Amazon, Google, Microsoft, and Yandex, then enhanced “thanks to our natural language processing algorithms,” the company says. Publishers who sign up for a free 14-day trial can test a book chapter with a range of natural voices.

Speechkit is producing audiobooks, targeting independent book publishers, but primarily for nonfiction at the moment, which it defines as “digital textbooks.” The company received early financing from Newark Venture Partners, where Audible is a senior investment partner.

Descript

Descript is a full-featured DIY solution that includes everything from screen recording and video editing to voice transcription and audio editing. Overdub is the most intriguing of its offerings. A voice actor makes a clone of their own voice, which can be used for corrections and omissions during the editing phase. A professional-level voice clone is created with 90 minutes of recorded voice (though the company says it can be generated in as few as 10 minutes).

There’s no feature available to import full book files, so the service builds books chapter by chapter.

According to Jay LeBoeuf, head of business and corporate development, Descript works with Audible, HarperCollins, International Scripture Ministries, and Pushkin Industries.

Pozotron

Pozotron is an audio proofing service that has been around for a few years, and its value is only going to increase as the industry feels its way through the uncertain quality of automated conversions. Pozotron’s system detects word mismatches, misreads, missing words, flipped words, and duplicated sentences.

Adam Fritz, Pozotron’s CEO, explains that the service will pick up on the obvious errors, so that when a human does the final quality control pass it will be easier to find the remaining mistakes. The software has three settings—normal, extra sensitive, and relaxed—so that an operator can decide how many potential errors will be flagged, and how many false positives they’re willing to contend with.

Fritz is clear-eyed on the current limitations of automated audiobook technology. “In art there are shades of gray that the black-and-white nature of an algorithm will never solve,” he says. “It’s going to get better and better. It’s the nature of technology that it will improve. But they’re never going to replicate the human voice completely.”

Scribe Audio

Scribe Audio is another vendor that eschews “fake” machine-generated voices in favor of voices originated by human actors, standing apart from the voices packaged by Amazon, Google, and others.

According to founder and CEO Ali Zia Khan, some of Scribe’s voices have been featured in popular TV series, Hollywood movies, and even Super Bowl ads. The company works with these actors on an unspecified “revenue sharing basis.”

Working with human voices and a “multi-tiered human-in-the-loop system”—not just an algorithm synthesizing the entire book—Scribe Audio is able to model the entire spectrum of human emotions, character voices, and other parameters, “allowing us to do not just nonfiction but also fiction,” Khan says.

“Nonfiction is the low-hanging fruit,” he adds. “We are specializing in fiction.” Pricing will be less than $1,000 per title, depending on volume. Scribe plans to launch in late 2021.

Fabula

Fabula is positioned near the low end of the market: pricing starts at $1.99 per month for beginners and students, or at $499 “per produced audiobook.” The modest pricing is enabled by the voices: Fabula uses only Microsoft Azure and Amazon Polly voices, not the higher-quality voices featured by most other services.

The company is a subsidiary of Wanderword, a Swedish game company, where the voice technology was originally designed for Alexa-based gaming. Fabula offers “simultaneous voice, music, and sound effects,” it says.

Respeecher

Respeecher, based in Kiev, sits in the voice-only space. As CEO Alex Serdiuk puts it: “We’re not in text to speech, we’re in speech to speech.” Even sophisticated TTS, he says, doesn’t give you control over the speaker’s emotions.

At Respeecher, a human reads through an entire text, and the software is able to define the pattern of speech, the pace, the inflections. From those speech patterns, clone voices are matched.

Respeecher’s best-known use of its technology was as part of Disney+’s The Mandalorian. A 20-year-old Luke Skywalker was set to make an appearance in the final episode of the second season; Mark Hamill was 68 years old at the time. Using its technology, Respeecher was able to voice the dialogue of a young Skywalker.

An interesting use case for Respeecher technology is creating foreign-language versions of audiobooks using the same voice as the English-language narrator.

Accessibility

Everything about audiobooks intersects with issues of accessibility for print-disabled readers. This is strangely overlooked in many of the discussions of where automated audiobooks are headed. The essential value proposition for print-disabled readers is obvious: hearing texts they cannot see clearly.

A large accessibility ecosystem, including organizations like DAISY and Benetech Bookshare, already promotes e-books for the print disabled. DAISY provides standards and helpful tools and resources; Benetech provides books in a variety of accessible formats. There are more than a million titles on tap, but it still leaves many millions of books that have never been converted to a digital format.

The startups profiled here could potentially make a huge difference in the number of professional-quality audiobooks available for the print-disabled.

Looking ahead

AI-enabled audiobook creation is a promising development that no publisher should ignore. Is it perfect? Certainly not. Can it be good enough? Probably, if a publisher is willing to spend the necessary time in the voice editing phase of the project. Clearly it works best for nonfiction, though a couple of vendors make a compelling case for fiction as well.

Audible’s block on the distribution of audiobooks with nonhuman narrators is a real problem that may take some time to resolve. But Audible is not the only game in town.

The vendors profiled here make it clear that they are not trying to replace narration on top-selling frontlist and backlist titles. The opportunity is in the deeper backlist, where a $500 or $1,000 audiobook investment might make financial sense.

Regardless, audiobook production is moving into a new phase, and AI-enabled technologies are at its heart.

Hearing Is Believing

It’s easy to describe this technology, but the proof is in the listening. There are several ways to get a taste:

DeepZen offers a finished book for free download, featuring the voice of Edward Herrmann:

server.glassboxx.co.uk/the-korean-war-audio.html

It also offers a range of shorter excerpts, including fiction, on its site:

deepzen.io/publisher

Speechki has a small selection:

speechki.org

Google TTS voices showcase a full range of short clips in multiple languages:

cloud.google.com/text-to-speech/docs/voices

It’s also worth visiting Balabolka (cross-plus-a.com/balabolka.htm) and TextToSpeechRobot.com to experiment with the technology firsthand, without cost.

Better still, try Microsoft’s Read Aloud feature (part of Office 2019 and Microsoft 365), which “reads an entire document... like an audiobook.”

Thad McIlroy is an electronic publishing analyst and author based on the West Coast who runs the Future of Publishing website. He is also a founding partner of Publishing Technology Partners.