How Much Data Do You Need to Measure Trends in Footfall?
In a world where data can be noisy, it’s sometimes useful to step away from the nitty-gritty and recover some perspective by contemplating the ideal scenario. This helps us to understand the limits of what could be inferred from a dataset, and provides insight into what the minimum amount of data needed may be to answer a question.
Of course, in a perfect world the modelled and the real data would match perfectly – and naturally they do not. When building a model to help analyse or predict behaviours, we start with the real data and work towards our synthesis. When handling the data directly however, one quickly encounters a number of common issues that require consideration.
Common challenges
- we may only have a limited number of samples or history
- thematic events produce behavioural anomalies (ie. holidays, weather events)
- changes to the technical environment in which the data is collected over time
- nonadjacency of data due to the churn of users in the underlying panel sample
For the purpose of this exercise however, to understand the minimum data requirement to support a model, let’s assume a perfect world. Let’s imagine we have a real population of 50 million people. Let’s also say we know that a proportion of those people on a given day exhibit some (binary) trait, for example that they visited a Starbucks outlet. Let’s call this proportion p
.
Our approach
We then might go on to ask that if p
changed, what’s the probability that we’d even notice? To answer our question, ‘how much data is enough data’, we’ll run multiple iterations of the same test with increasing volumes of data and measure how the p
value – the proportion of the panel that did [the thing] varies. Our expectation is that the larger the panel we use, the less variance we’ll see. The added dimension to this question is time; to what extent does modifying the number of days in the experiment produce a change in the results that we observe with confidence?
Setting up our variables
Let’s assume we have a single data point for each user every day. The number of daily points will be equivalent to the panel size that we are using in this increment of the test (sample_size
). We’ll call the total available population of 50 million people n_people
. As we increase the sample_size
and run multiple versions of each test, we’ll be looking to see how the variation in those results decreases as we increase our sample_size
in order to reach an acceptable level of consistency.
To handle the dimension of time, we’ll run this exercise over multiple days (n_days
), and see what happens when the p
value – the proportion of panel exhibiting the trait – changes to a new value. At that stage we’ll sample the same number of people from the daily population, and measure the number of those expressing this trait with the new value of p
.
Using the (mean) average of the first sample (with original proportion p
), and the mean of the second sample (where proportion p
has changed), we can then take the difference between these means and repeat it many times to produce a distribution. Using that series, we can find the area of the resulting distribution and see that it is closer to the real change (a confidence interval), as shown in the graph below. In this chart example the difference between the proportions is 0.01.
Our goal will be to find the shaded area for our different parameters (sample sizes etc) as the shaded area is the probability of finding an estimate for p
that is better than the original value of p
. We will implement this in two ways – firstly using a simulation (Monte Carlo), and secondly using a more analytical approach.
The implementation using Python
The following code snippets describe the implementation of this experiment in Python, using the two different approaches (simulation versus analytical). As per our variable design above, let’s go ahead and define these in the code.
# Import Python modules import numpy as np import pandas as pd from scipy import stats # Assumed number of people in full population n_people = 50000000 # The below variables are lists in order to be able to run for multiple values # Number of people we sample each day sample_sizes = [200000] # Proportion of people with some trait (between 0 and 1), 0.1=10% p_originals = [0.1] # Number of samples obtained, can be viewed as the number of consecutive days we get data n_sample_days = [1, 5, 10, 30, 100] # Create new proportion relative to the original proportion # 1.1 gives a 10% increase p_new_proportions = [1.1, 1.01, 1.001]
Next, let’s create a function that will return the p
values to use in our tests according to the arguments that we’ll pass in during our experiment.
def sample(sample_size, sample_days, p): """ Return sample_days-sized list where each element of this list is the proportion of points that are true (with probability p) from a sample_size-sized sample Note: this samples from an infinite population """ samples = [] for _ in range(sample_days): sample = np.random.binomial(sample_size, p) samples.append(sample/sample_size) return samples
Finally, let’s run the end-to-end experiment using the Monte Carlo method to simulate the p
value outcomes for each interval.
# To collect the results results = [] # The results are based on random samples, we need to average over many runs to get a stable answer to the simulation repeats = 50000 for sample_size in sample_sizes: for p_orig in p_originals: for p_new_proportion in p_new_proportions: p_new = round(p_orig * p_new_proportion, 7) for sample_days in n_sample_days: print (f'Simulating: Sample size: {sample_size} p: {p_orig} New p:{p_new} Sample days: {sample_days}') mean_diffs = [] for _ in range(repeats): sample_p_orig = sample(sample_size, sample_days=sample_days, p=p_orig) sample_p_new = sample(sample_size, sample_days=sample_days, p=p_new) if p_new > p_orig: mean_diff = np.mean(sample_p_new) - np.mean(sample_p_orig) else: mean_diff = np.mean(sample_p_orig) - np.mean(sample_p_new) mean_diffs.append(mean_diff) zero_percentile = stats.percentileofscore(mean_diffs, 0) results.append({'Sample size':sample_size, 'Original expected count': p_orig*sample_size, 'New expected count': p_new*sample_size, 'p': p_orig, 'New p': p_new, '% change in p': 100./p_orig * p_new - 100, 'Sample days': sample_days, 'No change percentile': zero_percentile }) # Format results results_df = pd.DataFrame(results) results_df = results_df.set_index(['Sample size', 'Original expected count', 'New expected count', 'p','New p', '% change in p', 'Sample days']).sort_index() # Get the interval (assumed symmetry) that does not contain that the means are the same, i.e. The shaded area of the graph above results_df['Probability of estimate closer to new value'] = 100. - (results_df['No change percentile']*2).clip(upper=100)
And let’s display the results:
results_df # display the results
Sample size | Original expected count | New expected count | p | New p | % change in p | Sample days | No change percentile | Probability of estimate closer to new value |
---|---|---|---|---|---|---|---|---|
200000 | 20000 | 20020 | 0.1 | 0.1001 | 0.1 | 1 | 45.5 | 9 |
5 | 40.499 | 19.002 | ||||||
10 | 36.679 | 26.642 | ||||||
30 | 28.253 | 43.494 | ||||||
100 | 14.688 | 70.624 | ||||||
20200 | 0.1 | 0.101 | 1 | 1 | 14.643 | 70.714 | ||
5 | 0.922 | 98.156 | ||||||
10 | 0.046 | 99.908 | ||||||
30 | 0 | 100 | ||||||
100 | 0 | 100 | ||||||
22000 | 0.1 | 0.11 | 10 | 1 | 0 | 100 | ||
5 | 0 | 100 | ||||||
10 | 0 | 100 | ||||||
30 | 0 | 100 | ||||||
100 | 0 | 100 |
It’s also possible to arrive at a similar set of results to the above using a more analytical, statistics-based implementation. This both verifies the above simulation, and runs much faster.
# Assuming the same import and variable declarations as above results = [] for sample_size in sample_sizes: for p_orig in p_originals: for p_new_proportion in p_new_proportions: p_new = round(p_orig * p_new_proportion, 7) for sample_days in n_sample_days: # Some of the assumptions only hold as the number of points increases, but for simplicity assume true n1 = sample_size * sample_days binomial_mu1 = p_orig binomial_var1 = p_orig * (1-p_orig) / n1 binomial_sd1 = np.sqrt(binomial_var1) n2 = sample_size * sample_days binomial_mu2 = p_new binomial_var2 = p_new * (1-p_new)/ n2 binomial_sd2 = np.sqrt(binomial_var2) # The above are approximately normal, subtraction is also approximately normal # For simplicity, subtract (expected) biggest from smallest if p_new > p_orig: mu3 = binomial_mu2 - binomial_mu1 else: mu3 = binomial_mu1 - binomial_mu2 # Note that the variance is proportional to sqrt(sample_size * sample_days) as stated by Central Limit Theorem sd3 = np.sqrt(binomial_var1 + binomial_var2) zero_percentile = 100. * stats.norm.cdf(0, mu3, sd3) results.append({'Sample size': sample_size, 'Original expected count': p_orig*sample_size, 'New expected count': p_new*sample_size, 'p': p_orig, 'New p': p_new, '% change in p': 100./p_orig * p_new - 100, 'Sample days': sample_days, 'No change percentile': round(zero_percentile, 3) }) results_df = pd.DataFrame(results) results_df = results_df.set_index(['Sample size', 'Original expected count', 'New expected count', 'p','New p', '% change in p', 'Sample days']).sort_index() results_df['Probability of estimate closer to new value'] = 100. - (results_df['No change percentile']*2).clip(upper=100)
Now let’s display the results:
results_df.loc[200000] # Results when sample size is 200000
Original expected count | New expected count | p | New p | % change in p | Sample days | No change percentile | Probability of estimate closer to new value |
---|---|---|---|---|---|---|---|
2000 | 2002 | 0.01 | 0.01001 | 0.1 | 1 | 48.733 | 2.534 |
5 | 47.168 | 5.664 | |||||
10 | 45.998 | 8.004 | |||||
30 | 43.092 | 13.816 | |||||
100 | 37.534 | 24.932 | |||||
365 | 27.191 | 45.618 | |||||
2020 | 0.01 | 0.0101 | 1 | 1 | 37.561 | 24.878 | |
5 | 23.919 | 52.162 | |||||
10 | 15.804 | 68.392 | |||||
30 | 4.124 | 91.752 | |||||
100 | 0.076 | 99.848 | |||||
365 | 0 | 100 | |||||
2200 | 0.01 | 0.011 | 10 | 1 | 0.096 | 99.808 | |
5 | 0 | 100 | |||||
10 | 0 | 100 | |||||
30 | 0 | 100 | |||||
100 | 0 | 100 | |||||
365 | 0 | 100 | |||||
20000 | 20020 | 0.1 | 0.1001 | 0.1 | 1 | 45.803 | 8.394 |
5 | 40.685 | 18.63 | |||||
10 | 36.947 | 26.106 | |||||
30 | 28.189 | 43.622 | |||||
100 | 14.597 | 70.806 | |||||
365 | 2.204 | 95.592 | |||||
20200 | 0.1 | 0.101 | 1 | 1 | 14.645 | 70.71 | |
5 | 0.934 | 98.132 | |||||
10 | 0.044 | 99.912 | |||||
30 | 0 | 100 | |||||
100 | 0 | 100 | |||||
365 | 0 | 100 | |||||
22000 | 0.1 | 0.11 | 10 | 1 | 0 | 100 | |
5 | 0 | 100 | |||||
10 | 0 | 100 | |||||
30 | 0 | 100 | |||||
100 | 0 | 100 | |||||
365 | 0 | 100 |
Interpreting the results
If each day – for example – we had a sample_size
of 200,000 from an n_people
value of 50,000,000 and there was a sudden real difference in the expression of a trait (visiting Starbucks) from 2,000 to 2,020 people, after 30 days we’d be 91.75% sure of getting a better estimate than using the original 2,000 value. This might be difficult but still possible to use to infer trend. For a small change from 2,000 to 2,002 however, this certainty is down to just 14% – so the size of that change would not be enough to observe with confidence.
This dataset differs from the real, or natural dataset in the following ways:
- The
sample_size
is constant - The trait is always measurable for each individual with 100% accuracy
- Measurement accuracy does not vary over time
- The trait
p
value is the same over the time period (ie. no seasonality) - The trait proportion changes abruptly as a step change
- The daily panel sample is unrelated to the adjacent day group
- The total population size is fixed
- There are no geographical behavioural biases
In conclusion
We have demonstrated a simulation of perfect world data, and obtained some guidance as to at what point we can infer that a change is true. While this is a long way from the reality of the natural dataset, it offers a sense of where the boundaries lie and the approximate signal strength that could be found in an ideal scenario. In a subsequent article we will extend the simulation to examine that more natural behavioural data.
Related Stories
Northern Cities More Adversely Affected by Restrictions than in The South
Mobility levels in the big northern cities failed to recover to their pre-pandemic levels following lockdown in the summer, in contrast to their southern counterparts new data shows.
Measuring Covid-19 Economic Resilience in Towns and Cities | PWC
PWC and Demos use footfall data from Huq Industries to measure the cities that have been most resilient to the economic impacts of Covid-19. With government investing in the safe re-opening of the…
European Professionals Return to Offices Ahead of UK Workers (Again)
EU workers are returning to their places of work after the Christmas break ahead of UK counterparts, with data from Huq Industries putting EU workers 7% ahead in the 20 days since the year began.…
In-Person UK Supermarket Trips Halve While Online Sales Double
Footfall across the UK’s biggest supermarkets has fallen by 51 percent in the last year despite their essential retailer status and booming food sales.
Brexit Effect on Transit Through UK Ports | Financial Times
On page one of todays Financial Times, data from Huq Industries offers an early indication as to how transit through UK ports has been affected since December 31st.
Oxford St Footfall Trends Shows Recovery Getting Faster
Foot traffic on London’s Oxford St paints an encouraging picture of what may come as presence touched pre-Covid levels in the brief interlude between lockdowns ahead of Christmas.
So Far in 2021: UK Ports Transits Down, Delays Increase
In the nine days since the Brexit transition period ended, high-frequency data from Huq Industries shows journeys through UK ports down 26% on January 2020.
Christmas Shopping Footfall on National Highstreets: The Data
With the festive season behind us and the new year begun, Huq looks back at high-street footfall during the Christmas shopping period between December 15-24th for Oxford Street, London and the Grand…
New Data Reveals Elf Industrial Productivity Soared 243pts in December
Elf presence across Christmas workshops in Lapland has reached a new high today, with The Elf Index showing a significant rise.