How to build working AI models when you are short on training data: AI as part of the business transformation toolkit (Part 3)
By Dr Simon Shakespeare
The sights of Artificial Intelligence (AI) can be trained on almost any business problem that will benefit from automation. But what if training data for your problem is in short supply? In this blog, Simon Shakespeare shares a few tricks for training AI without reinventing the algorithmic wheel or by creating training data out of thin air.
For most of the last century, engineers have been faced with having too much data from sensors and other inputs and not enough processing power to handle it. Before the computer age, people had to process small amounts of data by hand. Today, the machine learning community, and anyone looking to take advantage of AI and automation in the context of business transformation (see Part 1), faces a different problem: Often, there is too little data for training and testing a model. Fortunately, there are some interesting tricks to get around this challenge and it is worth exploring a few of them.
The rapid expansion of AI, which currently leaves no industry untouched, is the upshot of several developments: algorithm improvements in convolutional neural networks (Part 2), the availability of vast amounts of low-cost processing power, and data – lots of data. As the amount of data used to train and validate a network increases, so does the accuracy. A good example of this is that the best image identification system, ImageNet, has leaped from being an academic curiosity in the 1980s to super-human accuracy today.
Unfortunately, to create a classification system that generalises well and describes the features you wish to identify, a lot of data is essential. Getting enough data for training a network is particularly challenging if one is faced with a very specific set of data from a system unlike any other, say a specialised industrial process. If, however, the data is something similar to audio, image, video or text then we are in luck, because vast repositories of free data exist as open source datasets. These will likely be similar enough to what you are doing to get started.
Neural networks are trained by presenting them with examples of data and telling them what the data represents or contains, that is, with examples of labelled data. Real-life data is mostly unlabelled though. In such situations, traditional signal processing techniques that use clustering can be pressed into action, such as k means, a self-organising map or even the front end of a pre-trained feature extractor. By using just a few examples of labelled data the groups can be identified and the unclassified data used to generate them labelled correctly. Sometimes people are still required to classify data and, in these cases, leveraging the power and tacit knowledge of crowds to label large numbers of image samples via crowdsourcing platforms can hit the mark and be cost-effective.
The next trick in the box is the idea of transfer learning, also known as the fine art of not reinventing the wheel. Here, an existing, pre-trained model that can already perform a similar task is used as a starting point. Again, lots of free pre-trained models are released open source. If the task required is based on image data classification, then this approach can work well – by leveraging the massive image datasets used to train the pre-existing models. The interesting basic features in images, the textures, simple shapes, colour blobs and patterns have already been captured and are generally the same in all images. How these image elements are then connected to build up more complex features is captured in the higher levels of the neural network. These high-level layers can then be re-trained with new data for a specific application. The beauty is that most of the hard work has been already done, and training only those last few layers takes much less time, data and computational resources than training a new model from scratch.
Data for training a new model can also be created out of thin air. One example of artificial data generation is image augmentation, where an original image is distorted, transformed or scaled in various ways. Humans with good visual acuity are mostly insensitive to these tricks. Training artificial networks to achieve this same indifference to how the image is presented requires them to be trained with many different perspectives. After that, the network will faithfully map the important features present and ignore the way in which the images are presented. Images are perhaps the easiest type of data that can be synthesized, due to the large number of ways a basic image can be transformed or distorted while still retaining the objects of interest. The image set below, for instance, shows modifications of the same image of a cat to generate a new, larger set of synthetic images by means of scaling, shearing, mirroring, rotation and perspective distortions.
Some changes are invalid for training purposes of course. The network should never be expected to see an image of a cat hanging upside down off the world. Cats in images also come in different sizes, so the image can be scaled relative to the background environment. But there are limits to how small a cat can be. Also, adding multiple images together realistically is difficult, as real images contain reflections and shadows due to the presence of mirror-like surfaces and lighting sources.
Synthetic data can also be created for non-text, -audio, -image or -video data. An example of how to create synthetic medical data for training a neural network is shown below. An artificial electrocardiogram can be created from just one signal containing the maternal as well as a foetal heartbeat by adding together the two basic signals with the foetal one scaled in size, which is relatively easy. In this instance, creating synthetic signals based on a real data background is valid as the two electrical sources are independent and can be added directly. A more complex signal, such as when ultrasound signal is used to monitor the foetal heart would be more troublesome for creating synthetic training data. In this case we would need to carefully take account of the many reflections and absorbers that would preclude the straight addition of two signals, as they are not independent.
Whatever task you wish to automate using a machine learning-based system, you will need a lot of data to get reasonable accuracy at present. Sometimes getting data is difficult and can be expensive, sometimes the data is noisy, erroneous or missing elements. So, the most must be made of what is on hand. How the data is pre-processed (a topic not discussed in this post) and presented will determine how well the neural network generalises to the important features and how much bias is unintentionally introduced. By leveraging both pre-existing data and pre-trained models, however, one can quickly get to a trained system that is good enough.