Simple Curve Fitting in Python

In previous lectures, we learned how to visualize data using matplotlib and pandas, and how to perform numerical computations with numpy. A natural next step is to ask: can we find a mathematical model that describes the relationship between variables in our data?

Curve fitting is the process of constructing a mathematical function that passes through (or near) a set of data points. This is one of the fundamental techniques in data analysis and has applications across science, engineering, and business:

We will build understanding progressively, starting with the simplest case and building up to more complex models.

In the following, we will make some use of basic notions of linear algebra, such as

This means that we will be working with matrices and vectors, which in numpy are easily represented as 2D and 1D arrays, respectively.

Consider refreshing these concepts (see for example these comprehensive notes).

Solving a system of linear equations

Fitting data fundamentally involves finding the numerical values of parameters that describe the mathematical model between the variables in our data.

Before actually fitting data, let us try and solve a simpler problem: suppose we simply have a pair of unknowns, x and y, and we have two equations that relate them:

\[ \begin{aligned} 2x + 3y &= 5 \\ 4x - y &= 3 \end{aligned} \]

This is a system of linear equations, and we can solve it using various methods such as substitution, elimination, or matrix operations.

The matrix representation is the most convenient when the purpose is to develop a numerical scheme.

The matrix form of the above system is:

\[ A \vec{p} = \vec{b} \]

Where:

  • \(A\) is the coefficient matrix: \[ A = \begin{bmatrix} 2 & 3 \\ 4 & -3 \end{bmatrix} \]
  • \(\vec{p}\) is the vector of unknowns: \[ \vec{p} = \begin{bmatrix}x \\ y \end{bmatrix} \]
  • \(\vec{b}\) is the constant vector: \[ \vec{b} = \begin{bmatrix}5 \\ 3 \end{bmatrix} \]

We normally solve this by substitution and eliminination, leading to the solution \(x = 1\) and \(y = 1\).

The algebraic, matrix formulation is easily interpreted by the computer wich uses well established matrix transformation techniques (e.g. LU decomposition) to solve for the unknowns.

In numpy, this is implemented in the numpy.linalg.solve function, which efficiently solves the system of linear equations.

Constructing a linear model with data

Now, let’s imagine that we have a dataset of \(N\) points that we want to fit a line to. Let’s call them \(x_i=\{x_1, x_2, \dots, x_N\}\) and \(y_i=\{y_1, y_2, \dots, y_N\}\).

These could be, for example, measurements of the height of a particle at different times, or the price of a stock at different days.

The idea is that there is an underlying relationhsip between \(x_i\) and \(y_i\) but that the data is noisy for various reasons (e.g. measurement errors, natural variability, etc.).

This means that we are thinking of the data as being generated by a model of the form:

\[ y_i = m x_i + b + \epsilon_i \]

where \(m\) is the slope of the line, \(b\) is the intercept, and \(\epsilon_i\) is the error term that captures the noise in the data.

What we want to find the best values of \(m\) and \(b\) that describe the relationship between \(x_i\) and \(y_i\) in a purely linear model.

We can rewrite this linear model in matrix form as:

\[\boxed{\vec{y} = X \vec{\theta}}\]

where

\[ \vec{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} \]

\[ X = \begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_N & 1 \end{bmatrix} \]

and

\[ \vec{\theta} = \begin{bmatrix} m \\ b \end{bmatrix}\]

What we want to do is therefore to find the values of \(\vec{\theta}\) that minimize the distance between the predicted values \(X \vec{\theta}\) and the actual values \(\vec{y}\), which is equivalent to finding the values of \(m\) and \(b\) that minimize the sum of squared errors:

\[ \text{SSE} = \sum_{i=1}^N (y_i - (m x_i + b))^2 \]

It turns out that this problem can be worked analytically by taking the derivative of the SSE with respect to \(m\) and \(b\), setting it to zero, and solving for \(m\) and \(b\). This leads to the so-called normal equations for the determination of the parameters:

\[ \vec{\theta} = (X^T X)^{-1} X^T \vec{y} \]

where \(X^T\) stands for the transpose of the matrix \(X\) and \((X^T X)^{-1}\) is the inverse of the matrix \(X^T X\). This method is called ordinary least squares and is the most common method for fitting linear models to data.

The normal equations can be re-written as:

\[ X^T X \theta = X^T \vec{y} \]

This is simply a system of linear equations in the form \(A \vec{p} = \vec{b}\), where \(A = X^T X\), \(\vec{p} = \theta\), and \(\vec{b} = X^T \vec{y}\).

We know how to solve this from the previous section, and this is exactly what numpy.linalg.solve does when we call it with \(A\) and \(\vec{b}\).

We can see in example in actin in numpy by using artificial data. In the numpy implementation notice that we make use of a few handy tools:

  • np.ones_like(x) creates an array of ones with the same shape as x, which we use to create the intercept term in our design matrix.
  • np.column_stack((x, np.ones_like(x))) stacks the x array and the intercept array column-wise to create the design matrix X.
  • np.linalg.solve
  • the operator @ whhich performs matrix multiplication in a more readable way than np.dot or np.matmul.

Beyond linear models: polynomial fitting

The machinery that we have constructed for fitting with straight lines can be easily extended to fit more complex models, such as polynomials.

This is thanks to teh following observation: take, for example, a quadratic model:

\[ y_i = a x_i^2 + b x_i + c \]

Now we have an extra parameter, \(a\), that captures the curvature of the data. However, we can again rewrite the entire model in matrix form by expanding the matrix \(X\) to include the new term:

\[ X = \begin{bmatrix} x_1^2 & x_1 & 1 \\ x_2^2 & x_2 & 1 \\ \vdots & \vdots & \vdots \\ x_N^2 & x_N & 1 \end{bmatrix} \]

and then the parameter vector becomes: \[ \vec{\theta} = \begin{bmatrix} a \\ b \\ c \end{bmatrix} \]

so that again the model can be written as

\[\vec{y} = X \vec{\theta}\]

This means that the solution is given by the same normal equations as before!

Let’s see this in action again with numpy and some artificial data.

To plot the data it is important to note that the fitted curve needs to be evaluated typically at many more pints that the original \(x_i\) data, so that it looks smooth.

Practical Implementation: NumPy’s polyfit

Rather than implementing the matrix math ourselves every time, numpy provides the polyfit function, which fits a polynomial of specified degree to data. For linear regression, we use degree 1.

The syntax is straigthtforward:

coefficients = np.polyfit(x, y, degree)

returns the coefficients of the polynomial.

Errors on the fitting parameters

The fitting parameters we obtain are estimates under specific assumptions:

  • that the data has a random scatter (typically thought as due to normally distributed errors)
  • that such errors have constant variance across all observation and zero mean
  • that the various observations are not correlated with each other

The resulting estimates are associated with an uncertainty which can be quantified.

We can use the Sum of Squared Errors to construct a benchmark:

\[\sigma^2 = \dfrac{SSE}{N-p}\]

where \(N\) is the number of data points and \(p\) is the number of parameters in the model (e.g. \(p=2\) for a linear model).

We can then use this to obatin covariance matrix of the parameters, which is given by:

\[\text{Cov}(\vec{\theta}) = \sigma^2 (X^T X)^{-1}\]

where \(X\) is the design matrix we constructed for the fitting. This is a \(p \times p\) matrix that contains the variances of the parameters on the diagonal and the covariances between parameters on the off-diagonal.

The expression above can be thought as a ratio between the performance of the fit (as measured by \(\sigma^2\)) and the mutual independence of the parameters (as measured by the inverse of \(X^T X\)): the least clustered are the data points in the \(x\) space, the more independent are the parameters and the smaller is the covariance matrix. Conversely, if the data points are clustered together, the parameters become more correlated and the covariance matrix becomes larger.

In numpy we can obtain the covariance matrix of the parameters by using the cov argument of the polyfit function:

In practice, estimating the fitting parameters with their errors directly from the covariance matrix tends to underestimate true errors, for example when the model is not certain or when the data is not well-behaved.

A conceptually simply and robust approach is to recur to resampling: this method is called bootstrap.

Errors from bootstrapping

Instead of computing the covariance matrix, we can augment our dataset by resampling it with replacement, and then fitting the model to each resampled dataset.

The variability of the estimates yields immediately a variance of the parameters, which can be used to quantify the uncertainty of the estimates.

The procedure goes as follows:

  • we resample the data to produce a series of the same size as the original dataset, but with replacement (i.e. some data points may be repeated and some may be left out).
  • we fit the model to each bootstrap sample to obtain a distribution of parameter estimates.
  • we compute the standard deviation of the parameter estimates across the bootstrap samples to obtain an estimate of the uncertainty of the parameters.

The tuneable parameter is the number of bootstrap samples, which typically is in the range of 1000-10000.

Reassuringly, the errors are compatible with the ones obtained from the covariance matrix, but they are more robust to violations of the assumptions of the model and to outliers in the data.