Massive data versus relevant data – simply a case of quantity over quality?

1/16/2022

The co-founder of Google Brains, Andrew Ng, commented that “massive data sets aren’t essential for AI innovation.” Some years ago, I spoke with a person from a tech giant that also wanted to get into the data business. I asked him what data they wanted to collect and how it would be used. His answer was to get all possible data and then find a way to utilize it. His response says a lot about the data business.

Many companies want to start their data journey with a massive IT project to collect and store lots of data. Then the discussion is easily about IT architecture, tool selection and how to build all integrations. These projects take up large amounts of time and resources.

What is rarely considered is the real value we want from the data. And even if we have a plan for that, it can be forgotten during months or years of IT architecture, integration and piping projects. These projects are not run by people who want to utilize data; they are often run by IT bureaucrats.

Mr Ng also commented that people often believe you need massive data to develop machine learning or AI. There seems to be a belief that quantity can compensate for quality in data analytics and AI. I remember having a discussion with a wearable device company when their spokesperson claimed you needed data from millions of people to find anything useful for building models.

There are use cases where big data is valuable. Still, the reality is that in many use cases, you can extract considerable value from small data sets, especially if the data is relevant. We can also think of horizontal and vertical data sets, e.g. do we want to analyze one data point from millions of people or numerous data points from a small number of people. With the horizontal and vertical data, I don’t mean how they are organized in a table, but the horizontal approach to collect something from many objects, e.g. heart rate from millions of people, versus the vertical approach of having more data from fewer objects, e.g. a lot of wellness data from a smaller sample group.

But does it help to understand an individual’s wellness, sleep and health better? Looking at wellness data as an example illustrates the question well (no pun intended). A wearable device collects steps, heart rate and sleeping time from millions of people. We can then analyze this data to determine if more steps and a higher heart rate during the day predicts the person sleeps longer that night. Then we can find a model that predicts similar outcomes for other people.

We can take another model to build analytical models. One individual uses more wearable devices, for example, to collect the usual exercise, heart, and sleep data, but also blood pressure, blood glucose, body temperature, weight and some disease data. Now we might get different results about heart rate, step and sleep relationship. We might see that their relationship depends on other variables, e.g. high blood glucose or blood pressure changes the pattern that works for healthy people. These findings can be determined from a small number of people.

The examples above are not intended to make any conclusions about what is relevant to analyze health. Those conclusions must be drawn from the data itself, but it illustrates how it is possible to take different approaches and get quite different results. Wearable data at the moment is a good example of big data thinking; the target has been to collect a few data points from millions and millions of people and then just train data models to conclude something from it, although we don’t know how relevant those data points are. It is also possible to build models from rich data of a few individuals, and actually, it can be an exciting and valuable AI modelling task.

Of course, there are also cases where data models can be built from a massive amount of data even though we don’t know if it is relevant. For example, this podcast talks about hedge funds that try to collect all kinds of data and then build models to see if they can predict stock market movements. This includes much more than traditional finance data for investments. For example, how people buy different kinds of food, watch streaming content, and spend their free time and then find ‘weak signals’ to predict trends and their impact on the investment market. So, compared to many other data analytics cases, it is different because it doesn’t focus on analyzing particular detail but randomly collecting all kinds of data to see if it can find something relevant from it, hoping to find any new variables that could give a competitive advantage.

In most use cases, utilizing data and building AI would be important in understanding the need and target. Relevant data can be then chosen based on actual needs and testing which data matters. Small but relevant data can produce a useful AI model. This typically requires the context to be taken into account, not only a lot of random data points collected with a model built. Whatever data you have, you can always build a model, but it doesn’t guarantee the model makes any sense. Companies and developers should focus more on relevant data than big data.

Comments

Massive data versus relevant data – simply a case of quantity over quality?

About

Archives

Categories