# How Much Data Do You Need to Measure Trends in Footfall?

In a world where data can be noisy, it’s sometimes useful to step away from the nitty-gritty and recover some perspective by contemplating the ideal scenario. This helps us to understand the limits of what could be inferred from a dataset, and provides insight into what the minimum amount of data needed may be to answer a question.

Of course, in a perfect world the modelled and the real data would match perfectly – and naturally they do not. When building a model to help analyse or predict behaviours, we start with the real data and work towards our synthesis. When handling the data directly however, one quickly encounters a number of common issues that require consideration.

**Common challenges**

- we may only have a limited number of samples or history
- thematic events produce behavioural anomalies (ie. holidays, weather events)
- changes to the technical environment in which the data is collected over time
- nonadjacency of data due to the churn of users in the underlying panel sample

For the purpose of this exercise however, to understand the minimum data requirement to support a model, let’s assume a perfect world. Let’s imagine we have a real population of 50 million people. Let’s also say we know that a proportion of those people on a given day exhibit some (binary) trait, for example that they visited a Starbucks outlet. Let’s call this proportion `p`

.

**Our approach**

We then might go on to ask that if `p`

changed, what’s the probability that we’d even notice? To answer our question, *‘how much data is enough data’*, we’ll run multiple iterations of the same test with increasing volumes of data and measure how the `p`

value – the proportion of the panel that did [the thing] varies. Our expectation is that the larger the panel we use, the less variance we’ll see. The added dimension to this question is time; to what extent does modifying the number of days in the experiment produce a change in the results that we observe with confidence?

**Setting up our variables**

Let’s assume we have a single data point for each user every day. The number of daily points will be equivalent to the panel size that we are using in this increment of the test (`sample_size`

). We’ll call the total available population of 50 million people `n_people`

. As we increase the `sample_size`

and run multiple versions of each test, we’ll be looking to see how the variation in those results decreases as we increase our `sample_size`

in order to reach an acceptable level of consistency.

To handle the dimension of time, we’ll run this exercise over multiple days (`n_days`

), and see what happens when the `p`

value – the proportion of panel exhibiting the trait – changes to a new value. At that stage we’ll sample the same number of people from the daily population, and measure the number of those expressing this trait with the new value of `p`

.

Using the (mean) average of the first sample (with original proportion `p`

), and the mean of the second sample (where proportion `p`

has changed), we can then take the difference between these means and repeat it many times to produce a distribution. Using that series, we can find the area of the resulting distribution and see that it is closer to the real change (a confidence interval), as shown in the graph below. In this chart example the difference between the proportions is 0.01.

Our goal will be to find the shaded area for our different parameters (sample sizes etc) as the shaded area is the probability of finding an estimate for `p`

that is better than the original value of `p`

. We will implement this in two ways – firstly using a simulation (Monte Carlo), and secondly using a more analytical approach.

**The implementation using Python**

The following code snippets describe the implementation of this experiment in Python, using the two different approaches (simulation versus analytical). As per our variable design above, let’s go ahead and define these in the code.

# Import Python modules import numpy as np import pandas as pd from scipy import stats # Assumed number of people in full population n_people = 50000000 # The below variables are lists in order to be able to run for multiple values # Number of people we sample each day sample_sizes = [200000] # Proportion of people with some trait (between 0 and 1), 0.1=10% p_originals = [0.1] # Number of samples obtained, can be viewed as the number of consecutive days we get data n_sample_days = [1, 5, 10, 30, 100] # Create new proportion relative to the original proportion # 1.1 gives a 10% increase p_new_proportions = [1.1, 1.01, 1.001]

Next, let’s create a function that will return the `p`

values to use in our tests according to the arguments that we’ll pass in during our experiment.

def sample(sample_size, sample_days, p): """ Return sample_days-sized list where each element of this list is the proportion of points that are true (with probability p) from a sample_size-sized sample Note: this samples from an infinite population """ samples = [] for _ in range(sample_days): sample = np.random.binomial(sample_size, p) samples.append(sample/sample_size) return samples

Finally, let’s run the end-to-end experiment using the Monte Carlo method to simulate the `p`

value outcomes for each interval.

# To collect the results results = [] # The results are based on random samples, we need to average over many runs to get a stable answer to the simulation repeats = 50000 for sample_size in sample_sizes: for p_orig in p_originals: for p_new_proportion in p_new_proportions: p_new = round(p_orig * p_new_proportion, 7) for sample_days in n_sample_days: print (f'Simulating: Sample size: {sample_size} p: {p_orig} New p:{p_new} Sample days: {sample_days}') mean_diffs = [] for _ in range(repeats): sample_p_orig = sample(sample_size, sample_days=sample_days, p=p_orig) sample_p_new = sample(sample_size, sample_days=sample_days, p=p_new) if p_new > p_orig: mean_diff = np.mean(sample_p_new) - np.mean(sample_p_orig) else: mean_diff = np.mean(sample_p_orig) - np.mean(sample_p_new) mean_diffs.append(mean_diff) zero_percentile = stats.percentileofscore(mean_diffs, 0) results.append({'Sample size':sample_size, 'Original expected count': p_orig*sample_size, 'New expected count': p_new*sample_size, 'p': p_orig, 'New p': p_new, '% change in p': 100./p_orig * p_new - 100, 'Sample days': sample_days, 'No change percentile': zero_percentile }) # Format results results_df = pd.DataFrame(results) results_df = results_df.set_index(['Sample size', 'Original expected count', 'New expected count', 'p','New p', '% change in p', 'Sample days']).sort_index() # Get the interval (assumed symmetry) that does not contain that the means are the same, i.e. The shaded area of the graph above results_df['Probability of estimate closer to new value'] = 100. - (results_df['No change percentile']*2).clip(upper=100)

And let’s display the results:

`results_df # display the results`

Sample size | Original expected count | New expected count | p | New p | % change in p | Sample days | No change percentile | Probability of estimate closer to new value |
---|---|---|---|---|---|---|---|---|

200000 | 20000 | 20020 | 0.1 | 0.1001 | 0.1 | 1 | 45.5 | 9 |

5 | 40.499 | 19.002 | ||||||

10 | 36.679 | 26.642 | ||||||

30 | 28.253 | 43.494 | ||||||

100 | 14.688 | 70.624 | ||||||

20200 | 0.1 | 0.101 | 1 | 1 | 14.643 | 70.714 | ||

5 | 0.922 | 98.156 | ||||||

10 | 0.046 | 99.908 | ||||||

30 | 0 | 100 | ||||||

100 | 0 | 100 | ||||||

22000 | 0.1 | 0.11 | 10 | 1 | 0 | 100 | ||

5 | 0 | 100 | ||||||

10 | 0 | 100 | ||||||

30 | 0 | 100 | ||||||

100 | 0 | 100 |

It’s also possible to arrive at a similar set of results to the above using a more analytical, statistics-based implementation. This both verifies the above simulation, and runs much faster.

# Assuming the same import and variable declarations as above results = [] for sample_size in sample_sizes: for p_orig in p_originals: for p_new_proportion in p_new_proportions: p_new = round(p_orig * p_new_proportion, 7) for sample_days in n_sample_days: # Some of the assumptions only hold as the number of points increases, but for simplicity assume true n1 = sample_size * sample_days binomial_mu1 = p_orig binomial_var1 = p_orig * (1-p_orig) / n1 binomial_sd1 = np.sqrt(binomial_var1) n2 = sample_size * sample_days binomial_mu2 = p_new binomial_var2 = p_new * (1-p_new)/ n2 binomial_sd2 = np.sqrt(binomial_var2) # The above are approximately normal, subtraction is also approximately normal # For simplicity, subtract (expected) biggest from smallest if p_new > p_orig: mu3 = binomial_mu2 - binomial_mu1 else: mu3 = binomial_mu1 - binomial_mu2 # Note that the variance is proportional to sqrt(sample_size * sample_days) as stated by Central Limit Theorem sd3 = np.sqrt(binomial_var1 + binomial_var2) zero_percentile = 100. * stats.norm.cdf(0, mu3, sd3) results.append({'Sample size': sample_size, 'Original expected count': p_orig*sample_size, 'New expected count': p_new*sample_size, 'p': p_orig, 'New p': p_new, '% change in p': 100./p_orig * p_new - 100, 'Sample days': sample_days, 'No change percentile': round(zero_percentile, 3) }) results_df = pd.DataFrame(results) results_df = results_df.set_index(['Sample size', 'Original expected count', 'New expected count', 'p','New p', '% change in p', 'Sample days']).sort_index() results_df['Probability of estimate closer to new value'] = 100. - (results_df['No change percentile']*2).clip(upper=100)

Now let’s display the results:

results_df.loc[200000] # Results when sample size is 200000

Original expected count | New expected count | p | New p | % change in p | Sample days | No change percentile | Probability of estimate closer to new value |
---|---|---|---|---|---|---|---|

2000 | 2002 | 0.01 | 0.01001 | 0.1 | 1 | 48.733 | 2.534 |

5 | 47.168 | 5.664 | |||||

10 | 45.998 | 8.004 | |||||

30 | 43.092 | 13.816 | |||||

100 | 37.534 | 24.932 | |||||

365 | 27.191 | 45.618 | |||||

2020 | 0.01 | 0.0101 | 1 | 1 | 37.561 | 24.878 | |

5 | 23.919 | 52.162 | |||||

10 | 15.804 | 68.392 | |||||

30 | 4.124 | 91.752 | |||||

100 | 0.076 | 99.848 | |||||

365 | 0 | 100 | |||||

2200 | 0.01 | 0.011 | 10 | 1 | 0.096 | 99.808 | |

5 | 0 | 100 | |||||

10 | 0 | 100 | |||||

30 | 0 | 100 | |||||

100 | 0 | 100 | |||||

365 | 0 | 100 | |||||

20000 | 20020 | 0.1 | 0.1001 | 0.1 | 1 | 45.803 | 8.394 |

5 | 40.685 | 18.63 | |||||

10 | 36.947 | 26.106 | |||||

30 | 28.189 | 43.622 | |||||

100 | 14.597 | 70.806 | |||||

365 | 2.204 | 95.592 | |||||

20200 | 0.1 | 0.101 | 1 | 1 | 14.645 | 70.71 | |

5 | 0.934 | 98.132 | |||||

10 | 0.044 | 99.912 | |||||

30 | 0 | 100 | |||||

100 | 0 | 100 | |||||

365 | 0 | 100 | |||||

22000 | 0.1 | 0.11 | 10 | 1 | 0 | 100 | |

5 | 0 | 100 | |||||

10 | 0 | 100 | |||||

30 | 0 | 100 | |||||

100 | 0 | 100 | |||||

365 | 0 | 100 |

**Interpreting the results**

If each day – for example – we had a `sample_size`

of 200,000 from an `n_people`

value of 50,000,000 and there was a sudden real difference in the expression of a trait (visiting Starbucks) from 2,000 to 2,020 people, after 30 days we’d be 91.75% sure of getting a better estimate than using the original 2,000 value. This might be difficult but still possible to use to infer trend. For a small change from 2,000 to 2,002 however, this certainty is down to just 14% – so the size of that change would not be enough to observe with confidence.

This dataset differs from the real, or natural dataset in the following ways:

- The
`sample_size`

is constant - The trait is always measurable for each individual with 100% accuracy
- Measurement accuracy does not vary over time
- The trait
`p`

value is the same over the time period (ie. no seasonality) - The trait proportion changes abruptly as a step change
- The daily panel sample is unrelated to the adjacent day group
- The total population size is fixed
- There are no geographical behavioural biases

**In conclusion**

We have demonstrated a simulation of perfect world data, and obtained some guidance as to at what point we can infer that a change is true. While this is a long way from the reality of the natural dataset, it offers a sense of where the boundaries lie and the approximate signal strength that could be found in an ideal scenario. In a subsequent article we will extend the simulation to examine that more natural behavioural data.

### Related Posts

### Sainsbury’s Share of Grocery Visits Decline Continues into Q1 2020

Analysis of Huq's GB groceries sector visit-share shows the downward trend for Sainsbury’s continues into 2020 and becomes more acute. Will…

### Chinatown Footfall Down 9.1% on 3-year Mean in January 2020

In this analysis we measure the visitor trend to London’s Chinatown and observe a significant decrease in footfall following the start of…

### How Attitudes Towards Ad-Targeting Vary by Time and Country

Only 10% globally have chosen to take steps to withhold their MOAD. In the context of all we hear about privacy, does this figure surprise…

### A Real-time Refresh on 2011 Census Stats using Huq’s Geo-data

Leeds University BA student Ellie Marfleet takes on the 2011 Census using Huq's geo-data to gain more timely and up-to-date insights on…

### Huq Debuts Geo-data Products on BattleFin Ensemble

We are pleased to announce the launch of Huq’s Geo-data Daily Rollup product on BattleFin’s Ensemble platform, a one-stop shop for sources…

### Managing Massive Tables In BigQuery With Wildcards

BigData engineer Alessandro Puccetti lifts the lid on strategies for managing large datasets in BigQuery, in this case looking closely at…

### Five Essential Qualities for Geo-data Used in Stocks Prediction

We asked Isambard Poulson, Huq CTO, what are some of the key qualities that customers look for when evaluating sources of geo-data for…

### 100x Performance Gains with BigQuery Geography Clustering

Google BigQuery has introduced table clustering using geography-type columns, making spatial queries not only lightning fast but also…

### Normalisation in Mobile Geo-data Time-series Analysis

Mobile geo-data sheds a lot of light on how consumers interact with the physical world, but there are new challenges to face when making…