In this series, we look at interesting data sets to uncover hidden insights.
As we adjust to the realities of lock down it is only natural to look for indicators that might give us an indication of how the economy is faring and what might be in store. But here in lies a challenge, the vast majority of economic indicators run with a lag, with the exception of some truly terrifying jobless numbers coming out of the US there has been little thus far to analyse which gives a sense of the how the real economy (not the stock market!) is adjusting.
What then can we look at - one option is traffic data. As people self-isolate, work from home and many businesses shut we would clearly expect to see significant falls in the number of people on the road. It also has the added benefit of being readily available for many cities with open data policies.
Apart from the fact that I live here, I choose to analyse traffic in Hong Kong because unlike many countries it has managed to control the spread of the virus without the need for a mandatory lock down (so far...). This is in large part down to the fact that the population began to self-isolate from early in the crisis driven by the experience of SARS in the early 2000s.
The code use for this project can be found here. The analysis uses data on average traffic speed in the cross-harbour tunnel linking Hong Kong Island and Kowloon which is often highly congested and on coronavirus cases from John Hopkins.
The chart shows how a rolling 7-day average of traffic speeds at rush hour (9 a.m.) has changed over the past 3 months. In early January speeds averaged around 64 km/h. As awareness of the virus spread people began to self-isolate and many businesses encouraged staff to work from home. Traffic speeds rapidly increased over the next 3 weeks to almost 74 km/h. However, as the situation in China began to be brought under control and the city avoided a significant spike in cases many people began to relax and return to work, as more vehicles crowded onto the roads, traffic speeds fell.
By March 9th the situation in Italy had worsened considerably, in the days following the Italian lock down it became clearer that the situation in Europe and the US was on a dangerously exponential trajectory. On March 19th the government imposed mandatory quarantine requirements for all new arrivals. As the threat increased traffic speeds began to rise again significantly. (Note: an interactive version of the chart is available here)
Let us now see how well this corresponds to the number of coronavirus cases. The chart below shows the rolling 7-day average traffic speed alongside the global number of confirmed cases using a log scale so that we can see the rate of change rather than the now familiar exponential curve.
The chart can be split into three parts. The first runs up to mid-February in which both the rate of infection and traffic speeds are increasing rapidly. This is followed by a period in which the infection rate flattens out and traffic speeds fall. Finally from mid-March the rate of infection picks up again as does the speed of traffic.
Interactive version here.
Can we use this relationship to build a model?
Let's start by running a simple linear regression between traffic speed and the log of coronavirus cases.
As could be expected this does a poor job of predicting the relationship (r2 = 0.15) as the curve is not linear. If we filter the data to only look at the three different periods then the simple model explains 91.8% of the variation in period 1, 48% in period 2 and 83% in period 3. The fact that there are linear relationships between traffic and the rate of infection suggests that the we need to look for a polynomial linear model.
When using polynomial models we have to be careful in choosing the degree (degree = 2 is equivalent to a quadratic, degree = 3 is a cubic function and so on). Choosing a degree that is too low will result in under fitting the data, likewise if we choose a degree that is too high we will introduce bias and the model will not be useful for making predictions. In this case the fit is probably nest at around 4 degrees of freedom.
However, all of this highlights an important point - just because you can construct a model (at degree = 7, the model explains 69% of the variation) it does not mean it has any predictive power. This relationship between traffic speeds and rate of infection is false one. There are linear relationships in the data but they are not continuous. When an external shock occurs such as news of a new virus people will respond. As the perceived threat wanes people return to normal. If the threat returns then a new response is required.
There are ways that the model could be improved conceptually but these all run into problems when applying them to the data as they all require so much treatment of the data that the results become extremely difficult to interpret. Ultimately, the challenge lies with the quality of the input data, the figures relating to numbers of cases do not capture human sentiment and the extent to which their fears, sense of responsibility, government messaging drive their actions.
Comments