It’s been a few years since I last wrote about the idea of using synthetic data to train machine learning models. After having three recent discussions on the topic, I figured it’s time to revisit the technology, especially as it seems to be gaining ground in mainstream adoption.
Back in 2018, at Microsoft Build, I saw a demonstration of a drone flying over a pipeline as it inspected it for leaks or other damage. Notably, the drone’s visual inspection model was trained using both actual data and simulated data. Use of the synthetic data helped teach the machine learning model about outliers and novel conditions it wasn’t able to encounter using traditional training. It also allowed Microsoft researchers to train the model more quickly and without the need to embark on as many expensive, data-gathering flights as it would have had to otherwise.
The technology is finally starting to gain ground. In April, a startup called Anyverse raised €3 million ($3.37 million) for its synthetic sensor data, while another startup, AI.Reverie, published a paper about how it used simulated data to train a model to identify planes on airport runways.
After writing that initial story, I heard very little about synthetic data until my conversation earlier this month with Dan Jeavons, chief data scientist at Shell. When I asked him about Shell’s machine learning projects, using simulated data was one that he was incredibly excited about because it helps build models that can detect problems that occur only rarely.
“I think it’s a really interesting way to get info on the edge cases that we’re trying to solve,” he said. “Even though we have a lot of data, the big problem that we have is that, actually, we often only had a very few examples of what we’re looking for.”
In the oil business, corrosion in factories and pipelines is a big challenge, and one that can lead to catastrophic failures. That’s why companies are careful about not letting anything corrode to the point where it poses a risk. But that also means the machine learning models can’t be trained on real-world examples of corrosion. So Shell uses synthetic data to help.
As Jeavons explained, Shell is also using synthetic data to try and solve the problem of people smoking at gas stations. Shell doesn’t have a lot of examples because the cameras don’t always catch the smokers; in other cases, they’re too far away or aren’t facing the camera. So the company is working hard on combining simulated synthetic data with real data to build computer vision models.
“Almost always the things we’re interested in are the ‘edge cases’ rather than the general norm,” said Jeavons. “And it’s quite easy to detect the edge [deviating] from the standard pattern, but it’s quite hard to detect the specific thing that you want.”
In the meantime, startup AI.Reverie endeavored to learn more about the accuracy of synthetic data. The paper it published, “RarePlanes: Synthetic Data Takes Flight,” lays out how its researchers combined satellite imagery of planes parked at airports that was annotated and validated by humans with synthetic data created by machine.
When using just synthetic data, the model was only about 55% percent accurate, whereas when it only used real-world data that number jumped to 73%. But by making real-world data 10% of the training sample and using synthetic data for the rest, the model’s accuracy came in at 69%.
Paul Walborsky, the CEO of AI.Reverie (and the former CEO at GigaOM; in other words, my former boss), says that synthetic data is going to be a big business. Companies using such data need to account for ways that their fake data can skew the model, but if they can do that, they can achieve robust models faster and at a lower cost than if they relied on real-world data.
So even though IoT sensors are throwing off petabytes of data, it would be impossible to annotate all of it and use it for training models. And as Jeavons points out, those petabytes of data may not have the situation you actually want the computer to look for. In other words, expect the wave of synthetic and simulated data to keep on coming.
“We’re convinced that, actually, this is going to be the future in terms of making things work well,” said Jeavons, “both in the cloud and at the edge for some of these complex use cases.”