100x Performance Gains with BigQuery Geography Clustering

The BigQuery team introduced support for the geography type some time ago, and since then have not stopped enriching and adding to the functions and performance it offers. BigQuery allows users to run complex geo-spatial analytics directly – harnessing all its power, simplicity, and reliability.

In November 2019, Google announced a significant performance development for GIS-related queries, in that you can now cluster tables using a geography column. This is game-changing for those working heavily with geo-data. By clustering your table using a geography column, BigQuery reduces the amount of data that it needs to read in order to service the query. This makes queries cheaper and run faster when filtering on clustered columns.

Let’s evaluate the benefits of table clustering using a geography column. So that you can test this yourself, let’s use one of those great public datasets curated by Felipe Hoffa, the BigQuery Google Developer Advocate. We’ll use the weather_gsod dataset – and specifically the two tables named all and all_geoclustered. In order for us to test the geography column clustering gains, we need to filter the data based on a geographic property of the data.

For the sake of example, let’s imagine that we want to find the min- and max-temperature values for weather stations located in London. We’ll use our handy polygon tracing tool, Gismap, to draw a rough outline of the city and return the values in WKT (Well-known Text) to use directly in our query.

Putting it together, our query will look like this:

/* specify what to return */
SELECT
name, -- name of weather station
MIN(temp) AS min_temp, -- min temperature over all time
MAX(temp) AS max_temp -- max temperature over all time
 
/* the un-clustered noaa weather measurements table */
FROM `fh-bigquery.weather_gsod.all`
 
/* our geography condition, ie. where the station locations fall within our polygon */
WHERE st_within(point_gis, st_geogfromtext('POLYGON((-0.5191087546253357 51.69057940987663,-0.5850267233753357 51.530249275252345,-0.5685472311878357 51.33504597185959,-0.19226549290658568 51.2285324250183,0.20049573756216432 51.271511154557565,0.3185987649059143 51.55074848767514,0.2856397805309143 51.64628942375177,0.04943372584341432 51.72461916884517,-0.5191087546253357 51.69057940987663))'))
 
/* aggregate */
GROUP BY 1

The results – in degrees F – are as follows:

name min_temp max_temp
ST JAMES PARK 26.3 83.2
KENLEY AIRFIELD 18 82.8
HEATHROW 18.6 83.4
CITY 22.3 92.2
BLACKWALL 37 62.2
NORTHOLT 18.4 84.1
BIGGIN HILL 15.5 90.3
PURLEY OAKS 23 81.7
LEAVESDEN 22.3 89.1
LONDON WEA CENTER 20 85.2
KEW-IN-LONDON 23.3 82.4

This query reads 9.02GB of data, which is a lot of data to search for something that is geographically very narrow (London vs. the world), so there’s a lot of wastage here. 

Now let’s see how the same query performs on the clustered table:

/* specify what to return */
SELECT
name,
MIN(temp) AS min_temp,
MAX(temp) AS max_temp
 
/* the clustered noaa weather measurements table */
FROM `fh-bigquery.weather_gsod.all_geoclusterd`
 
/* our geography condition */
WHERE st_within(point_gis, st_geogfromtext('POLYGON((-0.5191087546253357 51.69057940987663,-0.5850267233753357 51.530249275252345,-0.5685472311878357 51.33504597185959,-0.19226549290658568 51.2285324250183,0.20049573756216432 51.271511154557565,0.3185987649059143 51.55074848767514,0.2856397805309143 51.64628942375177,0.04943372584341432 51.72461916884517,-0.5191087546253357 51.69057940987663))'))
 
/* aggregate query */
GROUP BY 1

The result is obviously the same but, this time, BigQuery reads just 98.97MB of the data – so switching to the clustered table made this same query almost 100x more efficient, and cheaper for you to use. Want to try other locations and test the differences for yourself? Use our GisMap tool to quickly draw and export polygons in WKT or GeoJson formats.