Normalisation in Mobile Geo-data Time-series Analysis

Mobile geo-data – that is, location data derived from mobile apps – offers many advantages for analysts who wish to understand how consumers interact with the real world around them. Business use-cases range from learning how footfall across retails’ assets changes with time, to how retailers share customers between them and how (or why) they get around. Please note that some chart metrics in this post have been redacted due to commercial sensitivity.

Mobile versus other geo-data sources

Conventional footfall measurement methodologies – such as the use of sensors – compare favourably to mobile data in that the measurement environment can be managed, and data can be collected deterministically. However, their measurement coverage is fixed to a sample of places which can limit their scope and application. It also takes time to design and deploy a new sensor measurement campaign, making it less suited to serving tactical or reactive market intelligence needs.

Mobile geo-data is comprised of geo-spatial events – at their most basic level a timestamp, a coordinate pair and a device identifier. This allows analysts and users of the data to map the world from the point of view of the consumer – in effect travelling with them as they visit different places and do the different things they do. This longitudinal quality offers new analytical advantages over conventional footfall measurement sources, although there is often less control over panel characteristics and how measurements are taken.

Two challenges posed by mobile geo-data

All this is to say that mobile geo-data sheds a lot of (new) light on how consumers interact with the physical world – but there are new challenges to face, particularly in respect of (a) measurement consistency and (b) changes in panel size and population churn over time. This month, UCL’s Terje Trasberg shared the results of her excellent work that compared Huq’s mobile geo-data to the LDC-CDRC’s SmartStreetSensor data – a deterministic but fixed sensor-led methodology – to reveal a ‘significant positive correlation’ between the two.

In this post we’ll review some key areas associated with the latter challenge (b) – how as app partnerships come and go, app users cycle and other (often unknown) factors come into play, mobile geo-datasets from vendors across the marketplace will present changes in panel size, churn and demographic composition with time. We’ll also share certain analytical remedies that have helped us to address them – though certainly these form only part of the many possible solutions.

Changes in panel population size

Consider the chart above. The Y axis is the number of visits to John Lewis outlets – the eponymous GB department store retailer – on a daily basis since our data begins in mid-2016. Should we take that linear rise over the timeframe to mean that visits to John Lewis stores has grown ~2,000% since the start of 2017? No, we should not. Instead, this upward slope inherits from both panel growth and measurement effectiveness, which accordingly will include an increase in visits to John Lewis assets. It does not mean that more customers are visiting their stores in real terms.

So, how can we tell whether (in suitable terms) more or fewer consumers are visiting John Lewis? All other things remaining equal, it might be enough to divide the number of John Lewis visits by the total number of events recorded on the platform on the same days to produce a relative metric expressed in percentage terms – ie. On Tues 5th Nov 2019, 0.21% of our GB panel visited a John Lewis store. This approach however can be liable to over-compensation from unrelated behaviours, geographically or otherwise. So in practice there are further variables to consider as outlined below.

Geographic normalisation

One reason not to simply divide the number of visits made to the stores in question by the total panel size on that day – or even just for that country – is relevance. If we were to incorporate activity from one part of the world into a normalisation factor (a denominator value) where the asset that we are measuring visits to has no footprint, we are suggesting that what people do in [Japan] is somehow related to consumers visiting John Lewis in the UK. It does not, and adding that data into the mix can do more harm than good. The same would be true if we were to factor-in activity from Scotland for London Underground visits – which only happen in and around London. Using a national panel in that case could introduce irrelevance into our output metrics.

Our experience reveals that a scalable, dynamic way to normalise geographically is to measure visits to the subject [John Lewis] locations, and separately build a buffer (a radius) around each point. That buffer could be fixed at – say – ‘n’ km around each point if the asset locations fell roughly inline with population. If not, it could equally take a dynamic value that incorporates a fixed population size using the asset’s location as the centroid point. By measuring the number of events generated within each of these buffers, and using those values to inform the normalisation value for each case, we ensure that both sides of the fraction are related to one another – and therefore relevant.

And, not all apps are equal

Huq’s mobile geo-data is sourced through a network of mobile app publishers, which in turn corresponds to apps numbering in their thousands. They all do different things, have different users and are resourced in different ways by the mobile OS. We invest heavily in recruiting apps that present low user churn rates, reflect a fair and balanced cross-section of society and are distributed geographically according to population.

Nobody – not even the app owners themselves – are in total control of this however, and their audience characteristics will change over time. It is therefore worth considering each app as an additional dimension in our analysis, along-side time and geography as discussed. This can be achieved by grouping our spatial fractions by time period and also by app identifier before combining the outputs for each group using simple averaging.

In effect this lends equal weight to each app and takes the composition of each app’s audience characteristics into equal consideration. This reduces the effect of possible app bias on the outcome of our query. We might also incorporate a low-pass filter for apps with low daily event counts in order to prevent over-stating the significance of behaviours from smaller and less significant contributors.

Wrapping it all up

The chart below shows what happens to the original sparkline once we start to apply some of these normalisation techniques. As you can see, the trend has lost its linear climb over the timeframe, and we can clearly see seasonality in the output where peak visitation periods map tightly to annual shopping events such as Christmas, back-to-school and bank holidays.

Doubtless there are many opportunities to finesse and extend these approaches, and there is a myriad of other ways address these challenges. This is by no means a recipe for everything we do for customers who wish to analyse our data longitunally – other that they do for themselves – but it goes to highlight some of the key areas to consider, and demonstrates solubility.

Of course, all of this relies on the accurate interpretation of place-visits being visits to those places, and not falsely-positive. The accuracy of place visit measurement and high panel consistency is where Huq’s geo-data really leads the way, so do get in touch with us if you’d like to evaluate the data and see for yourself.