Todd Perry

London, August 09, 2019

Analysing Road Speeds in South Africa

Analysis of Road Speeds in South Africa Using BigQuery, Python and Kepler

For this task we wanted to estimate and then visualise the average speed of movement on every road in South Africa. In order to do this we combined the freely available Open Street Map dataset, and our own Huq dataset, which has a pre-calculated speed dimension. To visualise our results we used Uber’s Kepler visualisation tool, which runs on top of Mapbox and can generate beautiful visualisations using geographic data.

Data Sources

As mentioned, the following two data sources were used:

Huq Industries Geo-behavioural Data
Open Streetmap Polygons

Huq’s unique dataset describes both the geographic and behavioural activity of anonymised consumers. For this task we used a two month extract of our South African data. Open Streetmap is a massive open source mapping project; the dataset comprises of geographical objects that represent roads, train lines, buildings, cities etc. At Huq we maintain a copy of the OSM dataset in BigQuery.

Development Process

Goolge BigQuery was used to do the initial join of the Huq data and the OSM data. First both datasets were filtered for South Africa, and then OSM dataset was further filtered so that only roads were present. The following method was used to combine the two datasets. This creates a 50m buffer around each road inn the OSM dataset, and checks if the Huq point data is contained within the resulting polygon.

select h.*, o.* from HUQ_ZA  as h
join OSM_ZA_ROADS  as o
on st_dwithin(h.impression_point, o.polygon, 50);

An average speed using the Huq data was then calculated for each road, and averages calculated using fewer than 10 data points were deleted as these speeds would be unreliable. This was as far as we were going to go with BigQuery for now, so we exported our results to GCS. Python3 was used to do the remainder of the analysis, the following libraries were used:

  • jupyter
  • pandas
  • numpy
  • shapely
  • keplergl

Our first task was to load the data into a python environment, and get the geo data (currently a string-ified GIS LINESTRING) into something that Kepler could interpret. Pandas was used to load the data, and Shapely was used to convert the GIS data into a GeoJSON-friendly format.

import pandas as pd
import shapely.wkt
import shapely.geometry
df = pd.read_csv('za_roads.csv')
df['polygon'] = df['polygon'].apply(lambda x: shapely.geometry.mapping(shapely.wkt.loads(x)))

Next, we wanted to remove any outliers from the data. We were primarily interested in speed, so we used numpy to calculate the mean and standard deviation of this attribute. We deemed anything ±3 standard deviations of the mean an outlier so they were deleted from the dataset before rendering. Our dataframe was finally ready for visualisation in Kepler. This final snippet converts this dataframe into GeoJSON, and loads it into Kepler:

from keplergl import KeplerGl
# this converts the dataframe into GeoJSON
gjson = {
  "type": "FeatureCollection",
  "features": [
          'type': 'Feature',
          'properties': { 'speed': row[1], 'name': row[2]},
          'geometry': row[-1]  
      for row in df.values ]
_map = KeplerGl(height=600)


The result of this work was a great looking Kepler map. The light colours represent roads with high average speeds as shown as light purple or white, and the roads with low average speeds are dark purple. It’s possible to see bright highways cutting through the darker suburban roads in Johannesburg and Cape Town. This suggests the average speed traveled on these highways is higher than the suburban roads as you would expect. This analysis could be used to identify areas that require improved infrastructure in urban environment, or to identify areas to place speed cameras.


This project showed that the Huq dataset can be seamlessly joined to 3rd party data sources, such as Open Street Map, to aid analysis - however some cleaning had to be carried-out on the data, for example removing polygons in areas where the Huq data didn’t provide sufficient coverage. Despite this the work still resulted in some pretty stunning - and more importantly - meaningful visualisations.