How Much Data Do You Need to Measure Trends in Footfall?

In a world where data can be noisy, it’s sometimes useful to step away from the nitty-gritty and recover some perspective by contemplating the ideal scenario. This helps us to understand the limits of what could be inferred from a dataset, and provides insight into what the minimum amount of data needed may be to answer a question.

Of course, in a perfect world the modelled and the real data would match perfectly – and naturally they do not. When building a model to help analyse or predict behaviours, we start with the real data and work towards our synthesis. When handling the data directly however, one quickly encounters a number of common issues that require consideration.

Common challenges

  • we may only have a limited number of samples or history
  • thematic events produce behavioural anomalies (ie. holidays, weather events)
  • changes to the technical environment in which the data is collected over time
  • nonadjacency of data due to the churn of users in the underlying panel sample

For the purpose of this exercise however, to understand the minimum data requirement to support a model, let’s assume a perfect world. Let’s imagine we have a real population of 50 million people. Let’s also say we know that a proportion of those people on a given day exhibit some (binary) trait, for example that they visited a Starbucks outlet. Let’s call this proportion p.

Our approach

We then might go on to ask that if p changed, what’s the probability that we’d even notice? To answer our question, ‘how much data is enough data’, we’ll run multiple iterations of the same test with increasing volumes of data and measure how the p value – the proportion of the panel that did [the thing] varies. Our expectation is that the larger the panel we use, the less variance we’ll see. The added dimension to this question is time; to what extent does modifying the number of days in the experiment produce a change in the results that we observe with confidence?

Setting up our variables

Let’s assume we have a single data point for each user every day. The number of daily points will be equivalent to the panel size that we are using in this increment of the test (sample_size). We’ll call the total available population of 50 million people n_people. As we increase the sample_size and run multiple versions of each test, we’ll be looking to see how the variation in those results decreases as we increase our sample_size in order to reach an acceptable level of consistency.

To handle the dimension of time, we’ll run this exercise over multiple days (n_days), and see what happens when the p value – the proportion of panel exhibiting the trait – changes to a new value. At that stage we’ll sample the same number of people from the daily population, and measure the number of those expressing this trait with the new value of p.

Using the (mean) average of the first sample (with original proportion p), and the mean of the second sample (where proportion p has changed), we can then take the difference between these means and repeat it many times to produce a distribution. Using that series, we can find the area of the resulting distribution and see that it is closer to the real change (a confidence interval), as shown in the graph below. In this chart example the difference between the proportions is 0.01.

Our goal will be to find the shaded area for our different parameters (sample sizes etc) as the shaded area is the probability of finding an estimate for p that is better than the original value of p. We will implement this in two ways – firstly using a simulation (Monte Carlo), and secondly using a more analytical approach.

The implementation using Python

The following code snippets describe the implementation of this experiment in Python, using the two different approaches (simulation versus analytical). As per our variable design above, let’s go ahead and define these in the code.

# Import Python modules
 
import numpy as np
import pandas as pd
from scipy import stats
 
# Assumed number of people in full population
n_people = 50000000
 
# The below variables are lists in order to be able to run for multiple values
 
# Number of people we sample each day
sample_sizes = [200000]
 
# Proportion of people with some trait (between 0 and 1), 0.1=10%
p_originals = [0.1]
 
# Number of samples obtained, can be viewed as the number of consecutive days we get data
n_sample_days = [1, 5, 10, 30, 100]
 
# Create new proportion relative to the original proportion
# 1.1 gives a 10% increase
p_new_proportions = [1.1, 1.01, 1.001]

Next, let’s create a function that will return the p values to use in our tests according to the arguments that we’ll pass in during our experiment.

def sample(sample_size, sample_days, p):
    """ Return sample_days-sized list
    where each element of this list is the proportion of points that are true (with probability p) from a sample_size-sized sample 
    Note: this samples from an infinite population
    """
    samples = []
    for _ in range(sample_days):
        sample = np.random.binomial(sample_size, p)
        samples.append(sample/sample_size)
    return samples

Finally, let’s run the end-to-end experiment using the Monte Carlo method to simulate the p value outcomes for each interval.

# To collect the results
results = []
 
# The results are based on random samples, we need to average over many runs to get a stable answer to the simulation
repeats = 50000
 
for sample_size in sample_sizes:
    for p_orig in p_originals:
        for p_new_proportion in p_new_proportions:
            p_new = round(p_orig * p_new_proportion, 7)
            for sample_days in n_sample_days:
                print (f'Simulating:  Sample size: {sample_size}  p: {p_orig}  New p:{p_new}  Sample days: {sample_days}')
 
                mean_diffs = []
                for _ in range(repeats):
                    sample_p_orig = sample(sample_size, sample_days=sample_days, p=p_orig)
                    sample_p_new = sample(sample_size, sample_days=sample_days, p=p_new)
 
                    if p_new > p_orig:
                        mean_diff = np.mean(sample_p_new) - np.mean(sample_p_orig)
                    else:
                        mean_diff = np.mean(sample_p_orig) - np.mean(sample_p_new)
                    mean_diffs.append(mean_diff)
                zero_percentile = stats.percentileofscore(mean_diffs, 0)
 
                results.append({'Sample size':sample_size,
                                'Original expected count': p_orig*sample_size,
                                'New expected count': p_new*sample_size,
                                'p': p_orig,
                                'New p': p_new,
                                '% change in p': 100./p_orig * p_new - 100,
                                'Sample days': sample_days,
                                'No change percentile': zero_percentile
                               })
 
# Format results
results_df = pd.DataFrame(results)
results_df = results_df.set_index(['Sample size', 'Original expected count', 'New expected count',
                                   'p','New p', '% change in p', 'Sample days']).sort_index()
 
# Get the interval (assumed symmetry) that does not contain that the means are the same, i.e. The shaded area of the graph above
results_df['Probability of estimate closer to new value'] = 100. - (results_df['No change percentile']*2).clip(upper=100)

And let’s display the results:

results_df  # display the results
Sample size Original expected count New expected count p New p % change in p Sample days No change percentile Probability of estimate closer to new value
200000 20000 20020 0.1 0.1001 0.1 1 45.5 9
5 40.499 19.002
10 36.679 26.642
30 28.253 43.494
100 14.688 70.624
20200 0.1 0.101 1 1 14.643 70.714
5 0.922 98.156
10 0.046 99.908
30 0 100
100 0 100
22000 0.1 0.11 10 1 0 100
5 0 100
10 0 100
30 0 100
100 0 100

It’s also possible to arrive at a similar set of results to the above using a more analytical, statistics-based implementation. This both verifies the above simulation, and runs much faster.

# Assuming the same import and variable declarations as above
 
results = []
for sample_size in sample_sizes:
    for p_orig in p_originals:
        for p_new_proportion in p_new_proportions:
            p_new = round(p_orig * p_new_proportion, 7)
            for sample_days in n_sample_days:
                # Some of the assumptions only hold as the number of points increases, but for simplicity assume true
 
                n1 = sample_size * sample_days
                binomial_mu1 = p_orig
                binomial_var1 = p_orig * (1-p_orig) / n1
                binomial_sd1 = np.sqrt(binomial_var1)
 
                n2 = sample_size * sample_days
                binomial_mu2 = p_new
                binomial_var2 = p_new * (1-p_new)/ n2
                binomial_sd2 = np.sqrt(binomial_var2)
 
                # The above are approximately normal, subtraction is also approximately normal
                # For simplicity, subtract (expected) biggest from smallest
                if p_new > p_orig:
                    mu3 = binomial_mu2 - binomial_mu1
                else:
                    mu3 = binomial_mu1 - binomial_mu2
                # Note that the variance is proportional to sqrt(sample_size * sample_days) as stated by Central Limit Theorem
                sd3 = np.sqrt(binomial_var1 + binomial_var2)
 
                zero_percentile = 100. * stats.norm.cdf(0, mu3, sd3)
 
                results.append({'Sample size': sample_size,
                                'Original expected count': p_orig*sample_size,
                                'New expected count': p_new*sample_size,
                                'p': p_orig,
                                'New p': p_new,
                                '% change in p': 100./p_orig * p_new - 100,
                                'Sample days': sample_days,
                                'No change percentile': round(zero_percentile, 3)
                               })
 
results_df = pd.DataFrame(results)
results_df = results_df.set_index(['Sample size', 'Original expected count', 'New expected count',
                      'p','New p', '% change in p', 'Sample days']).sort_index()
results_df['Probability of estimate closer to new value'] = 100. - (results_df['No change percentile']*2).clip(upper=100)

Now let’s display the results:

results_df.loc[200000]  # Results when sample size is 200000
Original expected count New expected count p New p % change in p Sample days No change percentile Probability of estimate closer to new value
2000 2002 0.01 0.01001 0.1 1 48.733 2.534
5 47.168 5.664
10 45.998 8.004
30 43.092 13.816
100 37.534 24.932
365 27.191 45.618
2020 0.01 0.0101 1 1 37.561 24.878
5 23.919 52.162
10 15.804 68.392
30 4.124 91.752
100 0.076 99.848
365 0 100
2200 0.01 0.011 10 1 0.096 99.808
5 0 100
10 0 100
30 0 100
100 0 100
365 0 100
20000 20020 0.1 0.1001 0.1 1 45.803 8.394
5 40.685 18.63
10 36.947 26.106
30 28.189 43.622
100 14.597 70.806
365 2.204 95.592
20200 0.1 0.101 1 1 14.645 70.71
5 0.934 98.132
10 0.044 99.912
30 0 100
100 0 100
365 0 100
22000 0.1 0.11 10 1 0 100
5 0 100
10 0 100
30 0 100
100 0 100
365 0 100

Interpreting the results

If each day – for example – we had a sample_size of 200,000 from an n_people value of 50,000,000 and there was a sudden real difference in the expression of a trait (visiting Starbucks) from 2,000 to 2,020 people, after 30 days we’d be 91.75% sure of getting a better estimate than using the original 2,000 value. This might be difficult but still possible to use to infer trend. For a small change from 2,000 to 2,002 however, this certainty is down to just 14% – so the size of that change would not be enough to observe with confidence.

This dataset differs from the real, or natural dataset in the following ways:

  • The sample_size is constant
  • The trait is always measurable for each individual with 100% accuracy
  • Measurement accuracy does not vary over time
  • The trait p value is the same over the time period (ie. no seasonality)
  • The trait proportion changes abruptly as a step change
  • The daily panel sample is unrelated to the adjacent day group
  • The total population size is fixed
  • There are no geographical behavioural biases

In conclusion

We have demonstrated a simulation of perfect world data, and obtained some guidance as to at what point we can infer that a change is true. While this is a long way from the reality of the natural dataset, it offers a sense of where the boundaries lie and the approximate signal strength that could be found in an ideal scenario. In a subsequent article we will extend the simulation to examine that more natural behavioural data.