Managing Massive Tables In BigQuery With Wildcards

BigQuery supports the * wildcard to reference multiple tables or files. You can leverage this feature to load, extract, and query data across multiple sources, destinations, and tables. Let’s see what you can do with wildcards with some examples.

The first thing is definitely loading the data into BigQuery. If you deal with a very large amount of data you will have, most likely, tens of thousands of files coming from data pipelines that you want to load into BigQuery. Using wildcards, you can easily load data from different files into a single table.

bq load project_id:dataset_name.table_name gs://my_data/input/prefix/* ./schema.json

Also, this is not limited to only one prefix but you can specify multiple ones for example:

bq load project_id:dataset_name.table_name gs://my_data/input/prefix_1/* gs://my_data/input/prefix_5/* gs://my_data/input/prefix_25/* ./schema.json

The command above will load all the files matching all the prefixes into the specified table.

Wildcards can be used in the other direction too. Namely, they can be used to export data from BigQuery to GCS. This is very useful especially because BigQuery limits exports to a single file only to tables smaller than 1GB.

bq extract project_id:dataset_name.table_name gs://my_data/extract/prefix/file_prefix_*.json

The previous command will result in multiple files exported into the my_data bucket within the prefix extract/prefix/ and all file names will be:

file_prefix_000000000000.json
file_prefix_000000000001.json
file_prefix_000000000002.json
...
file_prefix_000000003792.json
file_prefix_000000003793.json

The other very useful use of wildcards is evident in queries. In fact, you can reference multiple tables in a single query by using * to match all the tables in a dataset with the same prefix. For example, consider you have a collection of tables called events_US, events_GB, events_IT, … with all the ISO2 code of each country in our dataset. The following query will return the count per day per country of events of type ‘submit’ in our dataset.

select
    _table_suffix as country,
    day,
    count(*)
from events_*
where type = 'submit'
group by 1,2

You can also filter out matched tables using _table_suffix in the where clause. For example, if you are only interested in Germany, France, and Japan just run the following:

select
    _table_suffix as country,
    day,
    count(*)
from events_*
where type = 'submit'
and _table_suffix in ('DE', 'FR', 'JP')
group by 1,2

What I personally like the most of using wildcards is that it enables me to design better, simpler, and more generic analytics queries as well as ETL jobs. The aim of this post was to help you improve your code quality and your productivity.

Happy querying with BigQuery 🙂