18. January 2024

No professional AI without a valid database

Data preparation is anything but trivial

Artificial intelligence can do nothing without data. A principle that is clearly illustrated by the broad spectrum of current and future AI application scenarios. In the B2B environment, one thing is particularly important: machine-generated data that is valid and available in large quantities. An undertaking in which there are many things to consider.

Of course, the comparison between AI and the human brain is a bit of a stretch, but at this point it makes it clear what it is all about: the brain of a newborn baby is structured in exactly the same way and is just as complex as that of an adult. But it is only through sensory impressions and external perception – i.e. through data – that the synapses grow and network so that the newborn learns. And at enormous speed. As a result, the weight of the brain doubles in the first year of a person’s life.

Nothing works without training

Even a large number of AI algorithms can only perform their tasks reasonably reliably if they are fed with a lot of data. This is because in the field of machine learning (ML), every algorithm strives to generate the closest possible correlation between input, i.e. the data, and output. In other words, an algorithm that has analyzed millions of different cat pictures is highly likely to recognize a new picture of a cat as such, even in unusual environments or postures. The more different cat species he has seen in his “training phase”, the more reliably he will be able to distinguish and name the species when he sees new cat pictures. Provided it has received correct and correctly assigned data as input. An algorithm that has not received many images of, for example, Angora cats correctly sorted into this species as input can never identify the image of an Angora cat as such. It is therefore crucial that it is trained appropriately and correctly.

Only validated data produces correct results …

This applies to many algorithms. In recent months, for example, there has been talk that AI can now detect skin cancer more reliably than doctors themselves. “The artificial intelligence was significantly better than the average performance of the doctors,” reports Holger Hänßle, Professor of Dermatology at Freiburg University Hospital, in the daily newspaper “Die Welt”. For the experiment, an artificial neural network was trained with 100,000 photos. The images showed either correctly diagnosed black skin cancer or harmless moles. The algorithm was trained to recognize the difference. And he recognized it significantly better than highly respected specialists worldwide. Of 58 of these top specialists, only 13 were better than the AI. Another example is speech recognition. Here, too, the algorithms have to be trained using correct words and sounds so that voice assistants such as Siri or Alexa understand the user’s words correctly.

… so remember: those who learn the wrong things draw the wrong conclusions

This means that AI algorithms must not only be fed with a lot of data, but above all with correct data in order to achieve good results. Here again the analogy to humans. If we learn something incorrectly, we also come to incorrect conclusions.

Obtaining this correct training data in sufficient quantities is not entirely trivial. The data is often stored in individual silos, and often nobody knows exactly whether and how the data was changed in the past, or who exactly it came from. Of course, this applies in a very special way to data that comes from outside. This does not necessarily have to involve highly sensitive health or diagnostic data. Simple information on the sales of a product or the operating status of a machine may have been standardized retrospectively or, in the case of turnover data for a particular statistic, corrected slightly upwards to make the graph look better. Result: The data is no longer correct and can no longer be used for analysis. If they are used anyway, the results – for example sales forecasts – are wrong. Remember: Every AI algorithm wants to correlate its output as closely as possible with the input.

Opportunities to improve the database for AI

Of course, data quality is not a new challenge. Incorrect master data, for example, has always led to incorrect bookkeeping and accounting. But incorrect transaction data also continues to cause various problems, from dissatisfied customers and incorrectly set machines to faulty analyses and forecasts. If we now imagine what erroneous data in connection with AI can do in the worst-case scenario, it can be a little scary. Finally, AI applications are increasingly also partially automating human decisions.

To enable companies to feed their algorithms with correct data that is also aggregated in such a way that it can be fed into the relevant AI systems, there are more and more providers offering professional support with data preparation.

Nine steps to a valid database

It is generally advisable to adhere to the following nine steps:

  1. Eliminate or correct incorrect data and duplicates: There is an increasing amount of dirty data in the corporate environment. They are incorrect, incomplete or present multiple times. This dirty data contaminates the results of AI models if it is not corrected or removed.
  2. Standardization and formatting of data: How many different ways are there to store names and addresses or other information in databases? Under how many designations and in which tables are they stored? Metadata repositories and data catalogs must be introduced and maintained accordingly. Image and sound data are stored in a wide variety of formats and qualities. In order to make them accessible for machine learning algorithms, they also need to be standardized.
  3. Update outdated information: Even if data has been saved correctly and in the correct format, it may no longer be up-to-date. ML algorithms cannot be trained correctly if relevant and irrelevant or outdated data are mixed.
  4. Improving and enriching data: Sometimes the data from the company is not sufficient to adequately feed machine learning models. Additional data is then required, for example from calculation fields or external sources.
  5. Reduce noise: Images, text and data can contain “noise”. This is extraneous information, for example pixels, which does not help the algorithm. Powerful tools for data preparation reduce this noise.
  6. Anonymize and neutralize data: All personalized and personalizable data must be removed from the data records or anonymized or pseudo-anonymized. It is also necessary to eliminate data records that influence the algorithms unilaterally. Example: If the ideal executive is searched for using an ML algorithm that uses real data that takes into account the executive’s gender, the ML will primarily pick men. Quite simply because in most companies the upper management levels are predominantly occupied by men. The data must therefore either be adjusted for the gender characteristic or the female characteristic must be weighted more heavily.
  7. Normalization of data: If you normalize the data in your databases, you get redundancy-free data records, free them from anomalies and structure them clearly. This helps an AI to achieve better results.
  8. Provide a selection of data: When it comes to very large data sets, those responsible have to select sections for training the AI. It must be ensured that these extracts represent the entirety of the data as accurately as possible.
  9. Reinforce features: ML algorithms are trained on certain “features” in the data. To explain this using the chosen example of skin cancer screening: For example, attention must be paid to irregularities in the shape of melanomas. Data preparation tools can emphasize the data and improve the visibility of the features used to train the algorithms.

Do not underestimate data preparation

This recommendation for action makes it clear that it is very time-consuming for companies to go through these steps. However, they are indispensable if company managers want to obtain the necessary amount of valid data to lay a solid foundation for their AI projects. The American research company Cognilytica Research states in a corresponding study that 80 percent of the time in AI projects is spent preparing the data. This figure illustrates how important the quality of the database is for the reliability and accuracy of such a project.

Have we sparked your interest? Please feel free to contact us directly:

Dr. Frank Gredel

Head of Business Development

Related articles

Jetzt Kontakt aufnehmen

Contact now

Download file