Exploring fits

Exercise: learning gravity

We start with simulated data similar to where our course started from.

We have a ball that is thrown up in the air, and we measure its height at different times.

We assume that we do not know the physics of the problem, and our first attempt is to explore a polynomial fit to the data.

Task 1 Consider numpy polyfit and an arbitrary degree polynomial n to fit the data. How does the fit look for different values of n in the integer range 1 to 6 (included)? Use matplotlib to compare.

Note It can be convenient to use numpy.polyval to calculate the predicted values from the fitted polynomial coefficients. Use the numpy documentation to find out how to use it.

Task 2 Which model fitted best? To answer this question we can calculate a metric like the mean squared error (MSE) between the data and the fitted curve.

The mean squared error is calculated as follows:

\[MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2\]

where \(N\) is the number of data points, \(y_i\) is the actual value of the data point, and \(\hat{y}_i\) is the predicted value from the fitted curve.

Which value of n gives the best fit according to this metric?

Task 3 The overall MSE simply tends to go down as we increase the degree of the polynomial, but this does not necessarily mean that the fit is better. In fact, a very high degree polynomial can lead to overfitting, where the curve fits the noise in the data rather than the underlying trend.

A way to measure the quality of the fit is to split the data in two sets:

  • a first subset of data that we train to perform the fit (we call this training set)
  • a second subset of data that we use to evaluate the fit (we call this test set)

This splitting is easily done by selecting a random subset for the training and using the rest for the test. Take 80% of the data for training and 20% for testing.

Then, perform the polynomial fit on the training set, and calculate the MSE on the test set. Which value of n gives the best fit according to this metric?

Note. Use numpy’s pseudo random numbers to select a random subset of the data. To avoid repetitions, you can use a permutation of the indices, e.g:

rng = np.random.default_rng(seed=42)
n_samples = len(data[0])
# use these scrambled indices to select the training and test sets
indices = rng.permutation(n_samples)

Task 4. Now, put your training and test splitting in one function that takes the data and the polynomial degree as input, and returns the MSE for the test set in output.

For every degree n in the integer range 1 to 6 (included), call this function and repeat the process a number of times that is at least the number of samples. Collect all the MSE values in a list, and calculate the average test MSE for every polynomial degree.

Which value of n gives the best fit according to this averaged test MSE?

If the exercise worked as intended (e.g. if you have taken enough permutations), you should see that the fit with the lowest average MSE is given by the second degree polynomial: we should have obtained that the most robust model is the one that actually corresponds to the true nature of the simulated model!

In a way, we have rediscovered the physics of the problem by using a data-driven approach, under the simple assumption that the underlying model is a polynomial.

Extended Exercise: Not just polynomials

We now have a look at some real data for temperature variations in London, and we will try to fit a curve to it.

The goal here is not to get an interpolation of the data, but to fit a simple model to a much more complex real dynamics.

In particular, this kind of data presents two characteristic features:

  • an overall trend, which we can see as corresponding to a linear contribution in time
  • oscillatory patterns, with (at least) one characteristic time scale

Let us use the NASA Power API to extract the data and do some pandas processing to transform it suitably

Given the data above, what we want to do is to approximate the data with the following model

\[ y = at+b+ c \cos(2\pi t/P)+d \sin(2\pi t/P)\]

where \(P\) is the period of the oscillations. This is clearly a non-linear model that includes trigonometric functions.

However, it is quite simple, as the two trigonometric functions share the same parameter.

There is also another interesting fact: for a specific value of \(P\), the model is linear in the parameters \(a,b,c,d\) and we can use a simple linear regression to find the best fit. Indeed (for a fixed \(P\)!) we can rewrite the model in matrix form as

\[ y = X \theta\]

where \(\theta = (a,b,c,d)\) and \(X\) is now a matrix of four columns, comprising the time \(t\), a column of ones, and the two trigonometric functions.

Task. Your task is to find the best fit for the data, by finding the optimal value of \(P\) and then performing a linear regression to find the best values of \(a,b,c,d\).

You can break the exercise into smaller steps:

  • first you want to fix \(P\) to some value and then construct the matrix \(X\) and perform a linear regression to find the best values of \(a,b,c,d\). For this, remember that you can use the normal equations
    \[ X^T X \theta = X^T \vec{y} \]

and solve for \(\theta\) using np.linalg.solve()

theta = np.linalg.solve(X.T @ X, X.T @ y)
  • then improve your code to include a brute-force loop over a plausible range of values for the period \(P\) and find the optimal one by calculating the sum of squared errors (SSE) for each value of \(P\) and then picking the one that minimizes it. The SSE is defined as \[SSE = \sum_i (y_i - \hat{y}_i)^2\]

where \(\hat{y}_i\) is the value predicted by the model for the \(i\)-th data point. You can simply use np.sum() and np.argmin() for this. It is recommended to plot the SSE as a function of \(P\) to see how it behaves and to check that you are finding a clear minimum.

  • Finally, plot the original data and the best curve together to see how good the fit is. You can use matplotlib’s plt.plot() for this, and remember to label the axes and add a legend.

In the code below, some pre-processing is already done for you (conversion from date to hours etc.). You just need to fill in the blanks and complete the code to find the best fit and plot it.

Clearly, this fitting procedure is a bit convoluted (and Python has more sophisticated tools and libraries to do this such as scipy), but it is a good exercise to understand that when we move to non-linear models, some optimisation search is often needed.

It should also make you appreciate that the space of parameters can become very complex and that our benchmarks for goodness of fit (such as the SSE) can have multiple local minima. You may want to think of these minima and maxima as a landscape with hills and valleys, or points of high and low energy. This analogy brings together computing, physics and the chemistry of transition states, and is a very powerful way to think about optimisation problems.