The internet of things is going to generate a lot of data. IoT is all about data, and how cheap computing, ubiquitous connectivity, and low-cost data-generating sensors change the depth with which we can see the world and its processes around us. But the promise of all of that data confuses people.
At every conference I attend, someone inevitably gets up on stage to talk about how his company has managed to transform its business with better data. And they note that one of the essential steps in that process involves gathering tons of data from sensors and then sending it to the cloud, to something called a data lake. It’s the IT equivalent of hoarding. And it is not necessary.
The idea is that a company doesn’t know what data it needs or what might become important, so it tries to keep everything. Many companies that make connected products start out by taking this approach, but in most cases they should avoid data lakes entirely. Like a poorly organized closet, shunting data to a cloud can result in confusion over what data matters and how a company can use it. It’s also expensive.
And as more and more data is digitized, keeping it all is an exercise in futility. For example, Mimi Spier, vice president of the internet of things business at VMware, says that in some cases the cloud providers and telcos can’t handle the volume of data sent by connected devices. In one example, it took only 48 hours of receiving constant data before both the cloud provider and the telco gave up. “Companies are not going to be able to send everything to the cloud,” she says. “They’ll have to decide what is necessary and important.”
Even if they are physically able to send all of their data to the cloud, most don’t want to because of intellectual property theft concerns, cost, and regulations requiring that user data stay inside specific locales. So a company has to start thinking about its data in terms of the stuff it needs to send to the cloud and the stuff it needs to handle on the ground.
The most popular reason to send large chunks of data to the cloud is that companies hope they can one day run it through some magical neural network that will generate an algorithm designed to improve their business processes. But figuring out how to train a neural network, along with what types of data you’ll need to train that neural network on, requires knowledge of the business itself as well as deep data science expertise.
Plus, a lot of this can be done in gateway devices or a cluster of servers sitting on the factory floor or in an office. Jason Mann, vice president of IoT for SAS, says that detecting variations in machinery is something that can generally be done where the data lives. Algorithms can detect abnormalities and alert against them.
For example, a sensor can take product measurements and send alerts when results deviate from the original measurement. Other use cases include tracking when sound and vibration data changes, or when the time it takes to complete a process is shorter or longer than average.
SAS has an obvious interest and business here, but there are several startups trying to apply this sort of analytics to data before it hits a cloud. Falkonry, which just raised $4.6 million in funding, has built software that analyzes data from a machine or process and builds a model of how it should behave. Falkonry parses data on gateways at the edge and even trains the models it uses to detect problems there.
Nikunj Mehta, the CEO and founder of Falkonry, says that one of its customers was trying to decide if it was going to build a data lake or rely on some kind of edge analytics. It chose to halt its data lake plans once it discovered the capabilities of Falkonry’s software. This is a self-serving tale, but it is an example of how quickly machine learning deployed at the edge can eliminate the need for a company to send massive sets of data to a cloud.
Falkonry isn’t alone. Swim.ai is also trying to provide the smarts to use data in situ without having to employ an army of data scientists. But for companies looking to take advantage of their technologies, Mehta has some advice.
You have to already know what data is relevant and whether it is currently being generated. For some companies, knowing either of those things is a tall order. But once you figure that out, you may be on your way to dumping the data lake and taking advantage of learning at the edge.