Grow VC Group
  • Home
  • Group
  • Team
  • FAQ
  • Join Us
  • Trainee Program
  • Contact
  • News

Massive data versus relevant data – simply a case of quantity over quality?

1/16/2022

Comments

 
The co-founder of Google Brains, Andrew Ng, commented that “massive data sets aren’t essential for AI innovation.” Some years ago, I spoke with a person from a tech giant that also wanted to get into the data business. I asked him what data they wanted to collect and how it would be used. His answer was to get all possible data and then find a way to utilize it. His response says a lot about the data business.

Many companies want to start their data journey with a massive IT project to collect and store lots of data. Then the discussion is easily about IT architecture, tool selection and how to build all integrations. These projects take up large amounts of time and resources.

What is rarely considered is the real value we want from the data. And even if we have a plan for that, it can be forgotten during months or years of IT architecture, integration and piping projects. These projects are not run by people who want to utilize data; they are often run by IT bureaucrats.

Mr Ng also commented that people often believe you need massive data to develop machine learning or AI. There seems to be a belief that quantity can compensate for quality in data analytics and AI. I remember having a discussion with a wearable device company when their spokesperson claimed you needed data from millions of people to find anything useful for building models.

There are use cases where big data is valuable. Still, the reality is that in many use cases, you can extract considerable value from small data sets, especially if the data is relevant. We can also think of horizontal and vertical data sets, e.g. do we want to analyze one data point from millions of people or numerous data points from a small number of people. With the horizontal and vertical data, I don’t mean how they are organized in a table, but the horizontal approach to collect something from many objects, e.g. heart rate from millions of people, versus the vertical approach of having more data from fewer objects, e.g. a lot of wellness data from a smaller sample group.

But does it help to understand an individual’s wellness, sleep and health better? Looking at wellness data as an example illustrates the question well (no pun intended). A wearable device collects steps, heart rate and sleeping time from millions of people. We can then analyze this data to determine if more steps and a higher heart rate during the day predicts the person sleeps longer that night. Then we can find a model that predicts similar outcomes for other people.

We can take another model to build analytical models. One individual uses more wearable devices, for example, to collect the usual exercise, heart, and sleep data, but also blood pressure, blood glucose, body temperature, weight and some disease data. Now we might get different results about heart rate, step and sleep relationship. We might see that their relationship depends on other variables, e.g. high blood glucose or blood pressure changes the pattern that works for healthy people. These findings can be determined from a small number of people.

The examples above are not intended to make any conclusions about what is relevant to analyze health. Those conclusions must be drawn from the data itself, but it illustrates how it is possible to take different approaches and get quite different results. Wearable data at the moment is a good example of big data thinking; the target has been to collect a few data points from millions and millions of people and then just train data models to conclude something from it, although we don’t know how relevant those data points are. It is also possible to build models from rich data of a few individuals, and actually, it can be an exciting and valuable AI modelling task.

Of course, there are also cases where data models can be built from a massive amount of data even though we don’t know if it is relevant. For example, this podcast talks about hedge funds that try to collect all kinds of data and then build models to see if they can predict stock market movements. This includes much more than traditional finance data for investments. For example, how people buy different kinds of food, watch streaming content, and spend their free time and then find ‘weak signals’ to predict trends and their impact on the investment market. So, compared to many other data analytics cases, it is different because it doesn’t focus on analyzing particular detail but randomly collecting all kinds of data to see if it can find something relevant from it, hoping to find any new variables that could give a competitive advantage.

In most use cases, utilizing data and building AI would be important in understanding the need and target. Relevant data can be then chosen based on actual needs and testing which data matters. Small but relevant data can produce a useful AI model. This typically requires the context to be taken into account, not only a lot of random data points collected with a model built. Whatever data you have, you can always build a model, but it doesn’t guarantee the model makes any sense. Companies and developers should focus more on relevant data than big data.
​
Picture
Comments

    About

    Est. 2009 Grow VC Group is building truly global digital businesses. The focus is especially on digitization, data and fintech services. We have very hands-on approach to build businesses and we always want to make them global, scale-up and have the real entrepreneurial spirit.​

    Read the latest Grow VC Group  FinTech, distributed and crypto finance, and blockchain report
    Read the AI, Asia and FinTech report - including comments about potential trade wars.
    Download

    Research Report 1/2018: Distributed Technologies - Changing Finance and the Internet 


    ​Research Report 1/2017: Machines, Asia And Fintech:
    Rise of Globalization and
    Protectionism as a
    Consequence


    Fintech Hybrid Finance Whitepaper

    ​Fintech And Digital Finance Insight & Vision Whitepaper


    Learn More About Our Companies:
    • Difitek
    • Prifina​
    • RE Bearing
    • Token Index Fund
    • Startup Commons
    • Lost in Translations
    • Robocorp
    • Nodi Liber​

    Archives

    January 2023
    August 2022
    July 2022
    June 2022
    May 2022
    April 2022
    February 2022
    January 2022
    December 2021
    November 2021
    October 2021
    September 2021
    August 2021
    May 2021
    April 2021
    March 2021
    February 2021
    January 2021
    December 2020
    October 2020
    September 2020
    July 2020
    May 2020
    April 2020
    March 2020
    February 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    March 2019
    February 2019
    January 2019
    December 2018
    November 2018
    September 2018
    July 2018
    June 2018
    May 2018
    April 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    January 2017
    December 2016
    November 2016
    October 2016
    September 2016
    August 2016
    July 2016
    June 2016

    Categories

    All
    Difitek
    Grow VC Group
    Robocorp

    RSS Feed

Digital Intelligence Globally
Picture
© 2009-2023 Grow VC Operations Ltd. All Rights Reserved.
  • Home
  • Group
  • Team
  • FAQ
  • Join Us
  • Trainee Program
  • Contact
  • News