I was in Seattle this week at Microsoft’s Build conference, where video was one of the stars of the show. The software company demoed a camera-equipped drone flying over oil pipelines that was using machine learning to recognize oddities in the pipes. When anomalies were detected, the drone sent an image of the problem back to engineers monitoring the pipeline’s integrity.
Ensuring pipeline integrity is a common drone scenario, although there is a robust business of using pressure, temperature, and flow sensors, along with specialized software, to predict and track leaks. But that system has been called into question. Meanwhile, Microsoft’s head of IoT efforts, Sam George, points out that video analysis is cheaper than deploying sensors over hundreds of miles of pipeline.
Pipeline monitoring isn’t the only area where video may soon play a role. Smart cities are already using video to analyze traffic patterns at intersections, help people find parking spots, and track pedestrian activity. Other companies are pushing for cameras that detect defects in manufacturing lines or capture what goes wrong when a machine fails.
Video has a lot going for it. Cameras are cheap, high resolution, and capable of capturing a lot of data that we can train computers to see and react to. So the focus on video is logical, but inefficient. Humans are excellent visual processors. We don’t experience the world, we see it. Our thoughts on issues are our viewpoints. We are visual creatures, much like dogs are scent-oriented.
Computers are not. Computers, even those running the most advanced neural networks, take binary on/off signals and translate them into data. Their “brains” are geared toward processing massive quantities of data. But for a computer to see, it first has to translate a physical or represented object into a binary representation and then compare that set of on/off patterns to tens of thousands or even millions of other patterns. Only then can it “recognize” what’s in front of it.
From that point of recognition, a new set of computer instructions then tells the computer how to handle a particular object. Or perhaps it’s programmed to pattern match and look for anomalies. Either way, the process requires a tremendous amount of data, a tremendous amount of processing, and—depending on the sophistication of the image recognition model—a good bit of time. A machine trained to recognize a dog, a cat, or a person will be much faster than one trying to perform general purpose object recognition, for example.
However inefficient it may be in terms of bandwidth or computing resources, using computer vision and video may prove to be the best solution, because it enables humans to control computers in ways that are familiar to us. A company called Lighthouse offers a powerful example with its smart camera, which users control using natural voice commands.
Lighthouse has embedded into its product time-of-flight sensors along with a natural language processing engine. That means the camera understands verbal commands and can translate them into actions the computer can facilitate. So, for example, a coffee shop manager concerned about late employees might set up the camera in his store and instruct it to “notify me if you don’t see anyone in the shop by 5 a.m.”
The inefficiency that remains is creating a new set of opportunities for technology firms. Take Microsoft and its Project Brainwave FPGA computing efforts. FPGAs, or field-programmable gate arrays, are semiconductors that can be reconfigured to handle specific types of computing jobs. Microsoft has been using them in its data centers for quite some time. At Build, it said it plans to push those FPGAs to the edge.
Doug Burger, a senior CPU architect at Microsoft, says that the edge in this case isn’t battery-powered devices, but servers that run on factory floors, and perhaps one day inside an autonomous car. The FPGAs come on boards designed to slot into the servers and handle machine learning jobs. One of those jobs is processing video. Locating the video in a box on the factory floor (or perhaps on a gateway attached to a street lamp) cuts down on the cost of moving huge chunks of data to the cloud. It can also allow faster reaction times once the data is analyzed.
Burger says that because the algorithms and needs of these edge computers change constantly, it makes sense to have flexible silicon that can be reconfigured for new jobs. Which explains why Microsoft is working with Intel on these FPGAs. Meanwhile, Nvidia has been pushing its graphics processors for machine learning at the edge. Intel has also purchased a company called Movidius to help deliver machine learning capabilities for battery-powered computers.
And startups such as the recently launched Swim.ai and Zededa are building software that can help speed up the process of analysis and share it in a distributed fashion between connected devices such as cameras, sensors, and more. The companies aren’t betting solely on video, but the platforms are perfect for handling the computing needs of real-time (or close-to-real-time) video analytics. I expect we’ll see even more in the next few months.
What I’m wondering is, how far can we take video at the edge? Can it replace existing sensors, or supplant existing sensor growth into new environments? There are so many ways to understand what’s happening in a room using sound signatures, pressure sensors, and even RF interference. And some of these approaches generate a lot less data for a computer to handle. Are we gravitating toward cameras because of our innate bias toward visual data? How do we determine when we really need video in an IoT project and when it makes sense to use something else?
Those decisions fundamentally alter the types of computing architecture and investment a project needs, so I’m hoping we see some best practices and good commentary emerge from those who are building out various IoT pilots. Let me know.