Publishers like to pretend that they can sniff out potential winners, but major bestsellers come out of the blue. This is because bestsellers exist in a chaotic environment. Not because publishing is a messy business (though it often is) but in the mathematical sense, where chaos implies the impossibility of making predictions beyond a short timescale. So, when US academic Matthew Jockers teamed up with former editor Jodie Archer to suggest that artificial intelligence and big data make it possible to detect the bestsellers of the future, there were questions that needed asking. Was this a wonder tool for publishers or a disaster in the making?
Chaos theory is the branch of mathematics dealing with systems where interactions between different components result in horrendous complexity. The systems themselves might be simple – a pendulum with a hinge in the middle is chaotic – or complex, such as the weather. But they all share the property that tiny changes in the way things start off result in huge differences down the line. This is why weather forecasts will never be perfect and any attempt to forecast more than ten days out is less accurate than simply saying what the weather tends to be like at that time of the year.
It is true that weather forecasts have got more accurate in the last twenty years, but this is because forecasters have embraced chaos. Instead of making one simulation of the weather, they create hundreds, each with subtly different starting conditions, then average the outcome. This is why weather forecasts now tell us we have, say, a ‘90 per cent chance of rain’ — because 90 per cent of the simulations say it will rain. Even so, such forecasts still fail to say anything useful more than a few days ahead.
Deciding a book’s chances of becoming a major bestseller is similarly a problem of forecasting the outcome of a complex chaotic system. Yet in their book The Bestseller Code, Jockers and Archer claim to have overcome the problem using the technique du jour; big data with an artificial intelligence algorithm. Let’s break that down. Big data means that rather than carefully selecting what we believe to be appropriate information, we throw into the pot as much data as we can find that is even vaguely related to the subject and use anything that seems to work. An algorithm is simply a set of instructions for making something happen. It doesn’t have to be a computer program; a recipe is an algorithm, for example. In this case, the algorithm is a program used to look for patterns in the data.
The final piece of terminology is artificial intelligence. Since the 1940s, computer scientists have been exploring the possibilities of using computers to simulate intelligence. The results have been good when focussing on very small areas. Think of Deep Blue beating grand masters at chess, or Watson winning the TV show Jeopardy! but the effort has foundered when trying to provide general intelligence that parallels human capabilities. Recently, though, there has been much enthusiasm for ‘machine learning’ and ‘data mining’ the approach taken by Jockers and Archer.
The idea here is that rather than attempting to simulate human knowledge, we throw vast amounts of data at the computer without giving it any rules to sort the information out. Instead, the algorithm builds its own rules, learning from the data. So, for example, we don’t tell it what we think makes a bestseller. Instead we provide the system with all the data we have about books and their sales and ask it to find patterns amongst the books that did well. The software can then look for the same kinds of pattern in the content of other books – ones that have yet to be exposed to the market – and make predictions about which will become the next massive hits.
It sounds impressive, but there is a problem that is easy to miss, what economics professor Gary Smith calls ‘artificial unintelligence’. This occurs when a system ignores a mantra that all scientists should learn from day one: correlation is not causality. Just because we have correlation, where two things, let’s call them A and B, vary the same way with time so that the value of A rises and falls with that of B, this does not mean that A is caused by B. It could be that B is caused by A, that both are caused by a third factor… or it could be pure coincidence.
A good example of the potential for incorrect assumption of causality comes from the post-war UK banana effect. Following the Second World War, pregnancy rates in the UK rose and fell with banana imports — but, unsurprisingly, the bananas did not cause the pregnancies. It seems likely there was a third factor (probably economic) that influenced both imports and pregnancy rates.
Pure coincidence is a particular danger due to the approach taken in data mining. The more variables that are thrown into the pot, the more likely there will be a coincidental correlation between two of them. And the whole basis of the big data approach is to let the algorithm use as many variables as possible. When you have, as Archer and Jockers put it, ‘a computer model that can read, recognize, and sift through thousands of features in thousands of books,’ it will inevitably generate plenty of meaningless correlations.
The way that, given enough data, unconnected coincidences occur is demonstrated beautifully on the Spurious Correlations website. Here we discover, for example, that between 2000 and 2009 the US per capita consumption of cheese closely followed the number of people who died as a result of becoming tangled in their bedsheets. There was an even closer match between variation in the divorce rate in Maine and the US per capita consumption of margarine. We find these correlations funny, and dismiss them, because we understand the topics. It’s clear to us that it makes no sense that one should cause the other. But an algorithm can’t do this. It merely discovers correlations. So, a data-mining algorithm attempting to predict margarine consumption, given this data, would happily base it on divorce rates in Maine.
Unfortunately, exactly the same thing can happen when using this approach to predict the bestsellers of the future. We don’t know what combination of factors in what proportions influence bestsellerdom, and because it’s a chaotic system, even if we did, the chances are that we would never have accurate enough data to produce a good sales forecast. We don’t know that the factors used to make any prediction were genuine. They fit the data because they were derived from it. Attempts can be made to hold back part of the data and test the predictions against this but as long as thousands of parameters are being used, this will still result in spurious correlations.
As a result, any predictions based on this approach can go horribly wrong. And even if it were possible to work this way, there are clear limitations. Firstly, the system has no way of picking out good literature. It sees all the sales merits of, say, Dan Brown’s writing without any awareness of the quality. We also have to remember that just because a book is a potential bestseller does not mean that any individual reader will like it. The Bestseller Code contains a ‘list of 100 novels our computer thinks you should read’ — the algorithm’s top 100. I would be interested in reading three of them.
Another issue with the approach is that the advice produced by the system seems at odds with reality, let alone any predictions of future blockbuster titles. Archer and Jockers tell us that their model predicts that a bestseller author should avoid a number of areas. These include fantasy, very British topics, sex and description of bodies. So that would rule out The Lord of the Rings, James Bond, Harry Potter, the majority of literary novels and young adult titles. Not to mention Fifty Shades of Grey.
There seem to be two lessons for publishers and authors. Firstly, don’t put too much trust in the dark arts of AI and algorithms. And secondly, though following such rules may make it possible to churn out a cookie-cutter, mid-range title, this isn’t the way to produce great original fiction. Stick to your distinctiveness. Following the big-data route results in the bland leading the bland.