How to deal with a lot of data?

In the whole CRC Hybrid Societies, a lot of data is recorded and needs to be processed. Therefore, mathematicians are needed to find elegant ways to analyze the data and extract information that can be given back to draw conclusions.

Our project E01 gathers three different subprojects from three very different mathematical fields. We try to extract information from high-dimensional time-series. Sometimes it is very difficult to reconcile three different views, but on the other hand it can be very beneficial to consider a problem from multiple perspectives.

Many researchers like psychologists, engineers and sports scientists perform experiments, for example by conducting tests with participants. Often, many variables are recorded, like sensory data of physical properties of the person or the environment. Then the researchers may want to predict some objective values, for instance the feeling of comfort of a person, coded as a number between 0 and 100. As a mathematician we interpret that as a function which gets all measured variables and gives back the objective value. At the beginning nobody has an idea how this function looks like for some concrete application. But with the experiments, the researchers sample the function at some points. The aim is to model the function such that we can insert some new values for the variables, for which we do not have to do another experiment, instead the learned function can predict the objective value. Depending on the number of measured values this can be more or less precise.

One possibility is to use standard machine learning algorithms which predict the objective values out of a black box. But in many applications, it is necessary to know the origin of the results, or which variables should be changed to maximize or minimize the objective function. For instance, if some machine learning algorithm has to predict whether a person gets a credit or not, no one knows why the algorithm rejects some credit. Maybe, the person does not get a credit because of his even house number, that would not be very useful. That is why we search for explainable algorithms such that we can for instance distinguish between unnecessary and necessary variables. Out of the measured data we want to figure out that variables like the weather do not play a role for the prediction of the objective function. In this case this maybe sounds trivial, but if numerous variables were measured it is not a priori clear, which are important.

It is a natural characteristic of mathematicians that we often ask for the reason of some experiences. The mathematical language is clear and free of contradictions, which is something that you learn in a mathematics study. You can not only give some assertion which you expect to be true. Either you can prove it, or you are supposed to find some counterexample. Sometimes this needs much perseverance, but as a PhD student in mathematics you should get along with this, just as well as the fact that maybe some good ideas do not succeed, and you have to take another try.

Before we could apply algorithms to the given data, we have to do some mathematics and therefore have to analyze the assumptions exactly to show theoretical results. In my case the aim is mostly to approximate a high-dimensional function using only a few given function values. This could, roughly spoken, be like „the higher the regularity of a high-dimensional function, the better we can approximate it“, „How many parameters are necessary for certain required approximation error?“ or “Can we improve our algorithm if we know about the sparsity of the coefficients of the function?”. Possible questions that arise on the way are for instance, „How many sample points are needed to get a good approximation?“ or „Which functions can we recover after choosing good basis functions?“. Therefore, it is natural to use synthetic test data to check if the theory works. Despite of doing a lot of coding to develop some algorithms, sometimes the stereotype of mathematicians scribbling on a blackboard is pretty much fulfilled. Some really good ideas for theoretical proofs arise from such sessions.

In the past months, probably every researcher had to manage his studies despite the restriction of the pandemic. Luckily, computers do not occur in all the corona protection ordinances and laptops belong to the own household, so I can use it as often as I want. However, I learned that no Zoom Meeting could replace a face-to-face-discussion in front of a blackboard and, I guess every mathematician would agree, online-conferences are by far not the same as on-site conferences. There are some advantages of the online variant, but it is the way that sitting alone at home in front of the computer is not the same as getting the chance of good exchange with some experts. Therefore, let’s hope for the best to get our pre-pandemic lives back as soon as possible.

Laura Lippert (E01)

Photo Credit: Hyperbolic Wavelet Regression, photogenically arranged (blackboard: Laura Lippert)