Machine learning on high-sample-rate sensor data is different. For a lot of reasons. The outcomes can be very powerful – just look at the proliferation of "smart" devices and the things they can do. But the process that creates the "smarts" is fundamentally different than the way most engineers are used to working. It's by necessity more iterative, and it requires different analytical techniques than either the traditional engineering methods they're used to or the methods that work on machine logs and slower time series.
What is high-sample-rate data?
But let's start by being clear about what we're talking about: high-sample-rate sensor data includes things like sound (8kHz – 44kHz), accelerometry (25Hz and up), vibration (100Hz on up to MHz), voltage and current, biometrics, and any other kind of physical-world data that you might think of as a waveform. With this kind of data, you are generally out of the realm of the statistician, and firmly in the territory of the signal processing engineer.
Machine logs and slower time series (e.g., pressure and temperature once per minute) can be analyzed effectively using both statistical and machine learning methods intended for time series data. But these higher-sample-rate datasets are much more complex, and these basic tools just won't work. One second of sound captured at 14.4kHz contains 14,400 data points, and the information it contains is more than just a statistical time series of pressure readings. It's a physical wave, with all of the properties that come along with physical waves, including oscillations, envelopes, phase, jitter, transients, and so on.
It's all about the features
For machine learning, this kind of data also presents another problem – high dimensionality. That one second of sound with 14,400 points, if used raw, is treated by most machine learning methods as a single vector with 14,400 columns. With thousands, let alone tens of thousands of observations, most machine learning algorithms will choke. Deep Learning (DL) methods offer a way of dealing with this high dimensionality, but the need to stream real-time, high-sample-rate data to a deep learning cloud service leaves DL impractical for many applications.
So to apply machine learning, we compute "features" from the incoming data that reduce the large number of incoming data points to something more tractable. For data like sound or vibration, most engineers would probably try selecting peaks from a Fast Fourier Transform (FFT) – a process that reduces raw waveform data to a set of coefficients, each representing the amount of energy contained in a slice of the frequency spectrum. But there is a wide array of options available for feature selection, each more effective in different circumstances. For more on features, see our blog called "It's all about the features".
But this is about collecting data
But this post is really about collecting data – in particular about collecting data from high-sample-rate sensors for use with machine learning. In our experience, data collection is the most expensive, most time-consuming part of any project. So, it makes sense to do it right, right from the beginning.
Here are our five top suggestions for data collection to make your project successful:
1. Collect rich data
Though it may be difficult to work with directly, raw, fully-detailed, time-domain input collected by your sensor is extremely valuable. Don't discard it after you've computed an FFT and RMS – keep the original sampled signal. The best machine learning tools available can make sense out of it and extract the maximum information content. For more on why this is important, see our blog post on "Rich Data, Poor Data".
2. Use the maximum sample rate available, at least at first
It takes more bandwidth to transmit and more space to store, but it's much easier to downsample in software than to go back and re-collect data to see if a higher sample rate will help improve accuracy. Really great tools for working with sensor data will let you use software to downsample repeatedly and explore the relationship between sample-rate and model accuracy. If you do this with your early data, once you have a preliminary model in place you can optimize your rig and design the most cost-effective solution for broader deployment later, knowing that you're making the right call.
3. Don't over-engineer your rig
Do what's easiest first, and what's more expensive once you know it's worth it. If one is available to support your use case, try a prototyping device for your early data collects to explore both project feasibility and the real requirements for your data collection rig before you commit.
4. Plan your data collection to cover all sources of variation
Successful real-world machine learning is an exercise in overcoming variation with data. Variation can be related both to the target (what you are trying to detect) and to the background (noise, different environments and conditions) as well as to the collection equipment (different sensors, placement, variations in mounting). Minimize any unnecessary variation – usually variation in the equipment is the easiest to eliminate or control – and make sure you capture data that gets as much of the likely real-world target variation in as many different backgrounds as possible. The better you are at covering the gamut of background, target and equipment variation, the more successful your machine learning project will be – meaning the better it will be able to make accurate predictions in the real world.
5. Collect iteratively
Machine learning works best as an iterative process. Start off by collecting just enough data to build a bare-bones model that proves the effectiveness of the technique, even if not yet for the full range of variation expected in the real world, and then use those results to fine-tune your approach. Engage with the analytical tools early – right from the beginning – and use them to judge your progress. Take the next data you get from the field and test your bare-bones model against it to get an accuracy benchmark. Take note of specific areas where it performs well and performs poorly. Retrain using the new data and test again. Use this to chart your progress and also to guide your data collection – circumstances where the model performs poorly are circumstances where you'll want to collect more data. When you get to the point where you're getting acceptable accuracy on new data coming in – what we call "generalizing" – you're just about done. Now you can focus on model optimization and tweaking to get the best possible performance.