# How Much Data Do You Need to Measure Trends in Footfall?

In a world where data can be noisy, it’s sometimes useful to step away from the nitty-gritty and recover some perspective by contemplating the ideal scenario. This helps us to understand the limits of what could be inferred from a dataset, and provides insight into what the minimum amount of data needed may be to answer a question.

Of course, in a perfect world the modelled and the real data would match perfectly – and naturally they do not. When building a model to help analyse or predict behaviours, we start with the real data and work towards our synthesis. When handling the data directly however, one quickly encounters a number of common issues that require consideration.

**Common challenges**

- we may only have a limited number of samples or history
- thematic events produce behavioural anomalies (ie. holidays, weather events)
- changes to the technical environment in which the data is collected over time
- nonadjacency of data due to the churn of users in the underlying panel sample

For the purpose of this exercise however, to understand the minimum data requirement to support a model, let’s assume a perfect world. Let’s imagine we have a real population of 50 million people. Let’s also say we know that a proportion of those people on a given day exhibit some (binary) trait, for example that they visited a Starbucks outlet. Let’s call this proportion `p`

.

**Our approach**

We then might go on to ask that if `p`

changed, what’s the probability that we’d even notice? To answer our question, *‘how much data is enough data’*, we’ll run multiple iterations of the same test with increasing volumes of data and measure how the `p`

value – the proportion of the panel that did [the thing] varies. Our expectation is that the larger the panel we use, the less variance we’ll see. The added dimension to this question is time; to what extent does modifying the number of days in the experiment produce a change in the results that we observe with confidence?

**Setting up our variables**

Let’s assume we have a single data point for each user every day. The number of daily points will be equivalent to the panel size that we are using in this increment of the test (`sample_size`

). We’ll call the total available population of 50 million people `n_people`

. As we increase the `sample_size`

and run multiple versions of each test, we’ll be looking to see how the variation in those results decreases as we increase our `sample_size`

in order to reach an acceptable level of consistency.

To handle the dimension of time, we’ll run this exercise over multiple days (`n_days`

), and see what happens when the `p`

value – the proportion of panel exhibiting the trait – changes to a new value. At that stage we’ll sample the same number of people from the daily population, and measure the number of those expressing this trait with the new value of `p`

.

Using the (mean) average of the first sample (with original proportion `p`

), and the mean of the second sample (where proportion `p`

has changed), we can then take the difference between these means and repeat it many times to produce a distribution. Using that series, we can find the area of the resulting distribution and see that it is closer to the real change (a confidence interval), as shown in the graph below. In this chart example the difference between the proportions is 0.01.

Our goal will be to find the shaded area for our different parameters (sample sizes etc) as the shaded area is the probability of finding an estimate for `p`

that is better than the original value of `p`

. We will implement this in two ways – firstly using a simulation (Monte Carlo), and secondly using a more analytical approach.

**The implementation using Python**

The following code snippets describe the implementation of this experiment in Python, using the two different approaches (simulation versus analytical). As per our variable design above, let’s go ahead and define these in the code.

# Import Python modules import numpy as np import pandas as pd from scipy import stats # Assumed number of people in full population n_people = 50000000 # The below variables are lists in order to be able to run for multiple values # Number of people we sample each day sample_sizes = [200000] # Proportion of people with some trait (between 0 and 1), 0.1=10% p_originals = [0.1] # Number of samples obtained, can be viewed as the number of consecutive days we get data n_sample_days = [1, 5, 10, 30, 100] # Create new proportion relative to the original proportion # 1.1 gives a 10% increase p_new_proportions = [1.1, 1.01, 1.001]

Next, let’s create a function that will return the `p`

values to use in our tests according to the arguments that we’ll pass in during our experiment.

def sample(sample_size, sample_days, p): """ Return sample_days-sized list where each element of this list is the proportion of points that are true (with probability p) from a sample_size-sized sample Note: this samples from an infinite population """ samples = [] for _ in range(sample_days): sample = np.random.binomial(sample_size, p) samples.append(sample/sample_size) return samples

Finally, let’s run the end-to-end experiment using the Monte Carlo method to simulate the `p`

value outcomes for each interval.

# To collect the results results = [] # The results are based on random samples, we need to average over many runs to get a stable answer to the simulation repeats = 50000 for sample_size in sample_sizes: for p_orig in p_originals: for p_new_proportion in p_new_proportions: p_new = round(p_orig * p_new_proportion, 7) for sample_days in n_sample_days: print (f'Simulating: Sample size: {sample_size} p: {p_orig} New p:{p_new} Sample days: {sample_days}') mean_diffs = [] for _ in range(repeats): sample_p_orig = sample(sample_size, sample_days=sample_days, p=p_orig) sample_p_new = sample(sample_size, sample_days=sample_days, p=p_new) if p_new > p_orig: mean_diff = np.mean(sample_p_new) - np.mean(sample_p_orig) else: mean_diff = np.mean(sample_p_orig) - np.mean(sample_p_new) mean_diffs.append(mean_diff) zero_percentile = stats.percentileofscore(mean_diffs, 0) results.append({'Sample size':sample_size, 'Original expected count': p_orig*sample_size, 'New expected count': p_new*sample_size, 'p': p_orig, 'New p': p_new, '% change in p': 100./p_orig * p_new - 100, 'Sample days': sample_days, 'No change percentile': zero_percentile }) # Format results results_df = pd.DataFrame(results) results_df = results_df.set_index(['Sample size', 'Original expected count', 'New expected count', 'p','New p', '% change in p', 'Sample days']).sort_index() # Get the interval (assumed symmetry) that does not contain that the means are the same, i.e. The shaded area of the graph above results_df['Probability of estimate closer to new value'] = 100. - (results_df['No change percentile']*2).clip(upper=100)

And let’s display the results:

`results_df # display the results`

Sample size | Original expected count | New expected count | p | New p | % change in p | Sample days | No change percentile | Probability of estimate closer to new value |
---|---|---|---|---|---|---|---|---|

200000 | 20000 | 20020 | 0.1 | 0.1001 | 0.1 | 1 | 45.5 | 9 |

5 | 40.499 | 19.002 | ||||||

10 | 36.679 | 26.642 | ||||||

30 | 28.253 | 43.494 | ||||||

100 | 14.688 | 70.624 | ||||||

20200 | 0.1 | 0.101 | 1 | 1 | 14.643 | 70.714 | ||

5 | 0.922 | 98.156 | ||||||

10 | 0.046 | 99.908 | ||||||

30 | 0 | 100 | ||||||

100 | 0 | 100 | ||||||

22000 | 0.1 | 0.11 | 10 | 1 | 0 | 100 | ||

5 | 0 | 100 | ||||||

10 | 0 | 100 | ||||||

30 | 0 | 100 | ||||||

100 | 0 | 100 |

It’s also possible to arrive at a similar set of results to the above using a more analytical, statistics-based implementation. This both verifies the above simulation, and runs much faster.

# Assuming the same import and variable declarations as above results = [] for sample_size in sample_sizes: for p_orig in p_originals: for p_new_proportion in p_new_proportions: p_new = round(p_orig * p_new_proportion, 7) for sample_days in n_sample_days: # Some of the assumptions only hold as the number of points increases, but for simplicity assume true n1 = sample_size * sample_days binomial_mu1 = p_orig binomial_var1 = p_orig * (1-p_orig) / n1 binomial_sd1 = np.sqrt(binomial_var1) n2 = sample_size * sample_days binomial_mu2 = p_new binomial_var2 = p_new * (1-p_new)/ n2 binomial_sd2 = np.sqrt(binomial_var2) # The above are approximately normal, subtraction is also approximately normal # For simplicity, subtract (expected) biggest from smallest if p_new > p_orig: mu3 = binomial_mu2 - binomial_mu1 else: mu3 = binomial_mu1 - binomial_mu2 # Note that the variance is proportional to sqrt(sample_size * sample_days) as stated by Central Limit Theorem sd3 = np.sqrt(binomial_var1 + binomial_var2) zero_percentile = 100. * stats.norm.cdf(0, mu3, sd3) results.append({'Sample size': sample_size, 'Original expected count': p_orig*sample_size, 'New expected count': p_new*sample_size, 'p': p_orig, 'New p': p_new, '% change in p': 100./p_orig * p_new - 100, 'Sample days': sample_days, 'No change percentile': round(zero_percentile, 3) }) results_df = pd.DataFrame(results) results_df = results_df.set_index(['Sample size', 'Original expected count', 'New expected count', 'p','New p', '% change in p', 'Sample days']).sort_index() results_df['Probability of estimate closer to new value'] = 100. - (results_df['No change percentile']*2).clip(upper=100)

Now let’s display the results:

results_df.loc[200000] # Results when sample size is 200000

Original expected count | New expected count | p | New p | % change in p | Sample days | No change percentile | Probability of estimate closer to new value |
---|---|---|---|---|---|---|---|

2000 | 2002 | 0.01 | 0.01001 | 0.1 | 1 | 48.733 | 2.534 |

5 | 47.168 | 5.664 | |||||

10 | 45.998 | 8.004 | |||||

30 | 43.092 | 13.816 | |||||

100 | 37.534 | 24.932 | |||||

365 | 27.191 | 45.618 | |||||

2020 | 0.01 | 0.0101 | 1 | 1 | 37.561 | 24.878 | |

5 | 23.919 | 52.162 | |||||

10 | 15.804 | 68.392 | |||||

30 | 4.124 | 91.752 | |||||

100 | 0.076 | 99.848 | |||||

365 | 0 | 100 | |||||

2200 | 0.01 | 0.011 | 10 | 1 | 0.096 | 99.808 | |

5 | 0 | 100 | |||||

10 | 0 | 100 | |||||

30 | 0 | 100 | |||||

100 | 0 | 100 | |||||

365 | 0 | 100 | |||||

20000 | 20020 | 0.1 | 0.1001 | 0.1 | 1 | 45.803 | 8.394 |

5 | 40.685 | 18.63 | |||||

10 | 36.947 | 26.106 | |||||

30 | 28.189 | 43.622 | |||||

100 | 14.597 | 70.806 | |||||

365 | 2.204 | 95.592 | |||||

20200 | 0.1 | 0.101 | 1 | 1 | 14.645 | 70.71 | |

5 | 0.934 | 98.132 | |||||

10 | 0.044 | 99.912 | |||||

30 | 0 | 100 | |||||

100 | 0 | 100 | |||||

365 | 0 | 100 | |||||

22000 | 0.1 | 0.11 | 10 | 1 | 0 | 100 | |

5 | 0 | 100 | |||||

10 | 0 | 100 | |||||

30 | 0 | 100 | |||||

100 | 0 | 100 | |||||

365 | 0 | 100 |

**Interpreting the results**

If each day – for example – we had a `sample_size`

of 200,000 from an `n_people`

value of 50,000,000 and there was a sudden real difference in the expression of a trait (visiting Starbucks) from 2,000 to 2,020 people, after 30 days we’d be 91.75% sure of getting a better estimate than using the original 2,000 value. This might be difficult but still possible to use to infer trend. For a small change from 2,000 to 2,002 however, this certainty is down to just 14% – so the size of that change would not be enough to observe with confidence.

This dataset differs from the real, or natural dataset in the following ways:

- The
`sample_size`

is constant - The trait is always measurable for each individual with 100% accuracy
- Measurement accuracy does not vary over time
- The trait
`p`

value is the same over the time period (ie. no seasonality) - The trait proportion changes abruptly as a step change
- The daily panel sample is unrelated to the adjacent day group
- The total population size is fixed
- There are no geographical behavioural biases

**In conclusion**

We have demonstrated a simulation of perfect world data, and obtained some guidance as to at what point we can infer that a change is true. While this is a long way from the reality of the natural dataset, it offers a sense of where the boundaries lie and the approximate signal strength that could be found in an ideal scenario. In a subsequent article we will extend the simulation to examine that more natural behavioural data.

### Related Stories

### City Mobility in Europe Reaches Pre-Lockdown Levels as the North-South Divide Continues in US

The North-South divide continues in the US, as Huq's high-frequency geo-data suggests that population mobility in New York and Boston has been more closely aligned with cities in Europe than their…

### UK Hospitality Footfall Slows to 25% of Pre-Lockdown Levels

The recovery of the UK’s Restaurants and Pubs sector continues to take longer than hoped, with Huq’s mobile geo-data showing that footfall across these businesses has slowed to just 25% of…

### Car Manufacturing Holds at 60% as Staple Industrials Plateau

The recovery of Key Industrials across Europe appears to have plateaued in the last three weeks, as high-frequency geo-data shows that, following a sharp rise at the end of June, footfall to…

### Italy’s Hospitality Sector Climbs to Double January Levels as Seasonal Trade Returns

Footfall to Italy’s Restaurants, Cafes and Hotels has risen exponentially since the end of June, with levels now double what they were before lockdown in January as seasonal trade appears to be…

### UK Supermarkets Activity Holds at 90% of Pre-Covid Levels While Hospitality Shows Modest Gains

Footfall across UK supermarkets has risen to 90% of pre-Covid levels in the last two weeks as Huq’s mobility data suggests that Brits have had more confidence to visit these outlets since the the 4th…

### Financial Times: Huq Data Reveals European Travel Recovery

In today's Financial Times, European Economic correspondents Valentina Romei and John Burn-Murdoch reveal how an overall plateau in economic activity points to headwinds for the region. Huq's…

### US Consumer Sector Footfall Regains 90% of Pre-C19 Levels as China Drops 20pts

Data up until Saturday July 18th shows that mean footfall across all US Consumer industries has regained 90% of pre-lockdown levels, as Huq's high-frequency geo-location data shows that the country…

### Tapas in / Kebabs out as Spain Touches 87% of YoY Discretionary Levels, but Greece at -150%

The latest data from the Huq All-Consumer Indicator (Europe) shows strong signs of recovery across consumer discretionary industries for countries including Spain and Italy - but continued woe for…

### Brits Abroad Spikes as Quarantine Rules Ease on Foreign Travel

The number of Brits present in countries across Europe reached a new high on the day the Government ended its 14-day quarantine rule for returning travellers on Friday (10 July).