Thad McIlroy – Future Of Publishing Chasing the Secrets of The Bestseller Code

December 1st, 2016

In my last post I looked at the new book, The Bestseller Code, by Jodie Archer & Matthew Jockers. As it turns out the book also captured the attention of a colleague, Cliff Guren, who runs a publishing consultancy, Syntopical, based in Seattle. We decided to engage in an online dialog about the book, and record the conversation here and on Cliff’s blog.

Cliff: I know that you’ve already reviewed The Bestseller Code, but I think the book deserves more discussion. Maybe it is being talked about in New York conference rooms over individually brewed cups of coffee, but if so those private conversations aren’t leading to more public discussion. I worry that the publishing industry mistakenly believes that it has weathered the digital tsunami, but the ebook surge was only the leading edge of the storm. Big data and machine learning are technologies that will really pack a punch.

The Bestseller Code is a fun Nick and Nora detective story (although in this case it’s Matt and Jodie). Two smart, amiable people are compelled to figure out a mystery. In this case, why do certain books make it to the bestseller tables at Costco while others don’t? It’s the technology that has changed. The heroes in this version of the story have databases, computers, and algorithms instead of a notebook and remarkably intelligent dog (Asta).

Reading the book helps you understand how to think with the help of big data and the latest computer science. You learn how to frame and ask questions that you might not have thought to ask. That’s what machine learning does, and it’s what we will all be doing – asking different kinds of questions – as machine learning proliferates.

Publishing should be fertile ground for new learning technologies. I would have thought that the book would prompt wider discussion and debate. Time to pass the baton: why do you think that the book isn’t trending toward bestseller status?

Thad: The code for creating bestsellers has apparently eluded the authors of this book (although, in fairness, the code they describe is for fiction only). As of today it’s in the #90,000 range for ebooks on Amazon.com and #40,000 in print books. And dropping. By contrast Kevin Kelly’s, The Inevitable, another book we’re both reading, is #4,000 range in books for its $18 hardcover.

The Bestseller Code hasn’t appeared on the New York Times bestseller list nor on the Publishers Weekly list. And, short of a miracle, it’s not going to now.

When I first heard about the book, and saw that it was serious, not some sort of pop-science put-on, I was certain it would take publishing and authoring by storm. It hasn’t.

I could argue that the science is too complex and that’s offputting. But the authors did a pretty good job of simplify the science and the promo isn’t techie-scary.

I think they failed to make the book controversial. The title and marketing oversold what was actually delivered. They got scared (or their publisher did) and pulled their punches. I devoted my blog entry to trying to figure out what Archer and Jockers actually do claim in the book. And they don’t claim they can tell publishers exactly how to detect a bestseller in a slush pile. But surely that should be the point of the book. Either it’s a “bestseller detection system” or it’s of little use. They don’t offer even an “Abbreviated Bestseller Code – Try This at Home” – which readers could have tried and tested.

The fact that the book is not gaining traction in the market is a shame. Because I think that the science in the book is important to the future of publishing.

Cliff: “She blinded me with science!” Sorry, I couldn’t resist the Thomas Dolby reference… The science is important. It reflects significant advances in natural language processing and machine learning. What’s also interesting is the five year journey that our heroes took to get to the succinct analysis presented in the book. They began with a list of 28,000 features and whittled that down to 2,799 that proved to be relevant. That’s a lot of trial and error work.

Most publishers are unable or unwilling to make a long term, large scale investment in data science and technology. We saw the same pattern with ebooks. Publishers let technology companies drive the digital agenda and make the primary investments, figuring that the technology companies would have to come back to them for content. It would be a mistake to repeat that pattern this time around.

Traditional publishing is losing market share and profits are flatlining. While the ebook sales of traditional publishers have been declining, the sales of independently published content (a.k.a self-published books) have grown at a good clip. Traditional publishing hasn’t figured out how to successfully integrate with the indie authors – but Amazon has. Machine learning-driven evaluation and editing tools are a great on-ramp opportunity for traditional publishers that would help them leverage the things they are really good at: developing author brands, marketing, rights management and so forth. It will be interesting to see which, if any, publishers jump on this opportunity.

Back to you…

Thad: I want to take a moment to illustrate the complexity of the underlying science in the book. Chapter 4, titled “The Debutantes, or, Why Every Comma Matters” considers writing style. “In short, style is important” the authors note. “It is the mechanism through which plot, theme, and character get delivered.” Further they believe that the “first line of a novel can tell you a lot about a writer’s command of style.”

They reference the first sentence in Tolstoy’s Anna Karenina as one example. The famous sentence: “Happy families are all alike; every unhappy family is unhappy in its own way.”

“Tolstoy’s sentence,” they write, “is brilliant in its parallel structure. ‘Happy families are all alike; every unhappy family is unhappy in its own way.’ The simplicity on the ear complements the insight brilliantly….”

Some scholars have a different view of that memorable line.

Marian Schwartz, in the introduction to her 2014 translation of Anna Karenina, criticizes the standard text. She records the line as “All happy families resemble one another; each unhappy family is unhappy in its own way.” The “parallel structure” and “the simplicity on the ear” that Archer and Jockers cherish is less obvious in this variant.

Schwartz writes that “Tolstoy said not that happy families are ‘alike’ (odinukovye) but rather that they ‘resemble’ one another (pokhozhi drug nu druga)”… thereby pointing to “a more complicated opinion about those happy families. ‘Alike’ here is pat, almost dismissive, whereas ‘resemble’ requires additional verbiage (‘one another’) and a more subtle interpretation. Tolstoy’s phrasing is deliberately dense, forcing the reader to pause and introducing nuance….”

Angels dancing on the head of a pin, perhaps, but indicative of the complexity of using computers to analyze creative content. Translation is a science and an art. The nuances between the two approaches to translating Tolstoy’s famous opening sentence may not be apparent to Archer and Jockers’ algorithms. I wonder whether their system would have rated Anna Karenina as highly if it had been fed the second translation.

Not unrelated: The accuracy of Google’s translation software is improving rapidly with the application of machine learning. Google has switched from what was called Phrase-Based Machine Translation (PBMT) to Google Neural Machine Translation (GNMT), described in a recent blog post. Notwithstanding, Google’s take on the Russian original yields the standard translation that Schwartz scorns.

The Bestseller Code brings the still-new sciences of “text mining”, “machine learning”, “natural language processing” and more to the core tasks of writing and publishing. Where all of this will lead, I can’t say. But I’m looking forward to the ride.