Skip to main content

Chapter 10 * Residual Sum of Squares: Improving a Model’s Fit to Data

***This chapter is under construction.***
So far, we have estimated \(\beta\) in two ways: (1) visually, making adjustments so that data and graphs are closer together, starting in Chapter 3, and (2) using a formula involving \(\mathcal{R}_0\) and parameters from our model, such as in (5.1) and (7.3), to substitute in values we could find from epidemiological information and solve for \(\beta\text{,}\) which is typically the only unknown value in the formula. In this chapter, we learn a new approach.
Begin in Exploration 10.1 by examining data points as well as values determined by an SI model. In this exploration, think about mathematical approaches to improve the model’s fit to data.

Exploration 10.1. Visually Demonstrating Differences Between a Model and Data.

This example focuses on an SI model, similar to what we worked with in Chapter 2 and Chapter 3. Use the code shown, and its output, to answer the questions below.

(a)

Copy the graph from the Python output, either drawing it by hand or creating a screenshot.
  1. For the values \(I(0)\text{,}\) \(I(1)\text{,}\) \(I(2)\text{,}\) \(I(3)\text{,}\) \(I(4)\text{,}\) \(I(5)\text{,}\) and \(I(6)\) that are printed out in the Python output above the graph, label on your copy of the graph where each value is shown.
  2. Then write out in words what the values \(I(0)\text{,}\) \(I(1)\text{,}\) \(I(2)\text{,}\) \(I(3)\text{,}\) \(I(4)\text{,}\) \(I(5)\text{,}\) and \(I(6)\) represent: are these data points, or values generated by a mathematical model?

(b)

Next, read lines 9 and 10 in the Python code.
  1. These values combine to form several points on the output graph. On your copy of the graph, label the points determined by lines 9 and 10.
  2. Then write out in words what the values from lines 9 and 10 represent: are these data points, or values generated by a mathematical model?

(c)

We now put parts (a) and (b) together.
  1. Use your labeled graph to compare the values in part (a) with the values in part (b). Can you draw a visual way of comparing these values at each of the times \(t=0\text{,}\) \(t=1\text{,}\) \(t=2\text{,}\) \(t=3\text{,}\) \(t=4\text{,}\) \(t=5\text{,}\) and \(t=6\text{?}\)
  2. Starting with the visual comparison you just described: What would change visually if the values from the model, and the values at each data point, were closer together?
  3. Turn your visual comparison ideas into some mathematical suggestions for identifying when the model and the data are closer together.

Section 10.1 Residual Sum of Squares: An Introduction

One way to compare data with the output of a model is by computing the Residual Sum of Squares, often abbreviated to RSS. When we work with a single model to try to improve the fit of the model to a data set, RSS is a good option. Activity 10.2 shows the process of computing RSS for a model with one choice for a value of \(\beta\text{.}\) Later, we show how to compare RSS for different values of \(\beta\) used in the same model.

Activity 10.2.

In Exploration 10.1 we worked with both a data set, and model output, for times \(t=0, 1, 2, 3, 4, 5, 6\text{.}\) The values appear in Table 10.1, with model values rounded to two decimal places.
Table 10.1. SI Model: Data Values and Model Output
Time Data Value Model Output
0 1 1.0
1 2 4.19
2 4 14.54
3 10 32.38
4 22 44.58
5 40 48.68
6 50 49.70
RSS compares each Data Value from Table 10.1 with the Model Output from the same time by subtracting one from the other and squaring the result.
  1. Try at least two of these comparisons on your own: for times \(t=1\) and \(t=2\text{,}\) subtract the Model Output in Table 10.1 from the corresponding Data Value and square the result. Later in this activity, we will compare these hand-computed values with the corresponding values computed via Python.
  2. Next, read the code block below to identify where in the code, and how, we subtract each Model Output from its corresponding Data Value. Also identify where and how the code adds up all the squared differences to produce a total value for RSS.
  3. Then run the code block. Compare its results with your results for \(t=1\) and \(t=2\) in part (1) of this Activity. Describe how the results are similar and how they are different, and provide an explanation for why they are not identical.
    Then examine all the results from the code, including the total value for RSS for this model with this data set.
  4. Once you understand how the code works, experiment with different values of \(\beta\text{.}\) Your goal is to improve the fit between the model and the data set. While adjusting \(\beta\text{,}\) consider including a plot of the graph showing both the solution curve of the model and the data points. (You can do this by copying plotting commands from the Python block in this chapter’s Exploration and pasting them into the Python block in this activity.)
Answer.
  1. At time \(t=1\text{,}\) compute \((4.19-2)^2\) to get the value \(4.7961\text{.}\) At time \(t=2\text{,}\) compute \((14.54-4)^2\) to get the value \(111.0916\text{.}\)
  2. In line 25, the code I_data[x] - sol[10*x, 1] starts with the data value given by I_data[x] and subtracts the model value given by sol[10*x, 1].
    The x in I_data[x] is determined by the range of x values shown in line 24: range(0,7) tells Python to allow x to equal 0, 1, 2, 3, 4, 5, and 6. Python ranges do not include the last number shown, so 7 is not included in range(0,7).
    When x equals 0, I_data[0] equals the initial value of the I_data list in line 8, that is, I_data[0] equals 1. Then I_data[1] equals 2, I_data[2] equals 4, and so on, continuing until I_data[6] equals 50.
    The code sol[10*x, 1] examines Python’s list of model solutions, which we have named sol. To see the full list sol, you can add a new line of code at the end of the Python block, and in that line, type print(sol). Running the Python block with the print(sol) command shows that the list sol is arranged into rows and columns. Each row shows the value of both \(S(t)\) and \(I(t)\) at a specific time value. The initial column, which Python labels column 0, shows the values of \(S(t)\) at all times we computed. The next column, which Python labels column 1, shows the values of \(I(t)\) at all times we computed.
    The times for which we compute solutions are determined by the command t_range = np.arange(0.0, 8.0, 0.1) in line 18. This command says to compute solutions starting at time 0.0, and ending at time 8.0, moving in time increments of 0.1. This means we compute solutions at times 0.0, 0.1, 0.2, and so on, with the last solutions computed at time 7.9. To see the values in t_range, you can add a new line of code: print(t_range). Running the Python block with the command print(t_range) shows the list of all t_range values.
  3. The code produces the value \(4.795784246749833\) for the squared distance at time \(t=1\text{.}\) Our hand-computed value from part (1) of this Activity was \(4.7961\text{.}\) We see that these values are very close to each other. The difference between them is due to our rounding the values in Table 10.1 to two decimal points.
    Similarly, the code produces the value \(111.02234652523701\) for the squared distance at time \(t=2\text{,}\) whereas our hand-computed value from part (1) of this Activity was \(111.0916\text{.}\) Again, the difference in values is due to rounding.
    From here, make additional observations. This is open-ended. Observations may include noticing that the squared distances are notably larger when the data point is further from the value on the solution curve of the model, compared with when the data point is close to the value on the solution curve of the model. There may also be observations about the overall number computed for RSS. This overall RSS number is helpful when we reach part (4) and try to improve the fit of the model to data
    Many other observations are possible.
  4. This is open-ended. Values of \(\beta\) that are closer to \(0.02\) tend to produce a better fit of model to data, compared with the starting value of \(\beta = 0.03\text{.}\) You may wish to change additional decimal places of \(\beta\) to improve the fit even further.
    It is also appropriate to give visual (graph-based) and verbal (word-based) descriptions of changes in the graph of model and data, as the value of \(\beta\) changes.

Section 10.2 Residual Sum of Squares: Quickly Comparing Multiple Parameter Values

As we saw in Section 10.1, we can compare a model with data by trying different parameter values, such as by allowing \(\beta\) to take on different values. In the Python code block within Activity 10.2, for every value of \(\beta\) we wanted to try, we had to change the value of \(\beta\) within the code and run the code again. Next, we wrap another for loop around the RSS for loop, so that we can test multiple values of \(\beta\) with one click of the Evaluate button.

Activity 10.3.

This Activity focuses on coding, with the goal of comparing RSS values for multiple values of a parameter without having to run Python code separately for every value of that parameter. Instead, read the code in the Python block below, and use what you read to answer the questions that follow.
  1. (Questions to appear)

Section 10.3 Influenza Data and RSS

Now try computing RSS using the influenza prevalence data. The data are already written into the code block below. Time 0 is the same as October 6.

For Further Thought 10.4 For Further Thought

1.

As described starting on page 356 of Mathematical Models in Population Biology and Epidemiology, second edition, by Fred Brauer and Carlos Castillo-Chavez, the village of Eyam, in England, was affected by bubonic plague in 1665-1666. There were two separate plague outbreaks, and Table 10.2 shows data for the second of these outbreaks.
Table 10.2. Plague Data from Eyam, 1666
Date (Day) Susceptibles Infectious
July 3/4 (Day 0) 235 14.5
July 19 (Day 15.5) 201 22
August 3/4 (Day 31) 153.5 29
August 19 (Day 46.5) 121 21
September 3/4 (Day 62) 108 8
September 19 (Day 77.5) 97 8
October 4/5 (Day 93) Unknown Unknown
October 20 (Day 108.5) 83 0
  1. Use Python to plot these data.
  2. Then, use the information that we start with a population of 250 people, the infectious period was 11 days, and \(\mathcal{R}_0\) has been estimated to be between 1.6 and 1.9, to construct an SIR model for this outbreak.
  3. Then use an RSS approach, along with the data and your SIR model, to select a preferred value of \(\beta\) for this SIR model. You may choose to compute \((\mbox{Data Value} - \mbox{Model Output})^2\) separately for each day on which population values are available, or you can use a Python for loop for the six consecutive data points, and then add in \((\mbox{Data Value} - \mbox{Model Output})^2\) for the October 20 data, in order to complete each total RSS value. Additionally, be sure that Python selects the correct value from its model output: this will require changing the 10 in the sol[10*x, 1] part of the code.
    (NOTE for the Bates Mathematical Models in Biology course in Fall 2024: you may compute RSS for comparing the model with just the first six data points, rather than needing to include the last data point, and rather than needing to figure out how to skip over the "Unknown" data point. However, if you can find a way to include the last data point, go ahead and include it!)
    In your homework, show your Python code, and show at least three different values of \(\beta\) that you tried, along with the corresponding RSS value for each value of \(\beta\text{.}\) Which value of \(\beta\) leads to the smallest RSS value? Confirm that this value of \(\beta\) makes biological sense, based on the numbers or ranges of numbers we know for \(N\text{,}\) \(\mathcal{R}_0\text{,}\) and \(\gamma\text{.}\)