When managing or analyzing web analytics data, you often need to exclude test URLs or specific patterns from your dataset. SQL’s regexp_contains
function combined with a LEFT JOIN
provides an efficient way to handle this task. This guide demonstrates how to implement this approach in BigQuery with a practical example.
1. Why Filtering Specific URLs is Necessary
Suppose you are analyzing website traffic data, and your dataset contains test URLs like:
- Traffic from the
localhost
environment (https://localhost/admin
) - Traffic from the staging environment (
https://staging.example.com
)
These test data entries can skew your analysis of real user behavior. Removing them ensures cleaner and more reliable insights. Let’s explore how to achieve this in BigQuery.
2. What is regexp_contains
?
The regexp_contains
function checks whether a string matches a given regular expression. Its syntax is as follows:
regexp_contains(input_string, regex_pattern)
- input_string: The string to check (e.g., a column value).
- regex_pattern: The regular expression to match.
This function returns TRUE
or FALSE
. For example:
SELECT regexp_contains("https://localhost/admin", ".*localhost.*") AS is_match;
The result is TRUE
because the string contains localhost
.
3. Managing URL Patterns in a Table
To keep things organized, store the URL patterns you want to exclude in a CSV file or a table. For example, create a CSV file (traffic_patterns_to_exclude.csv
) like this:
regex |
---|
.localhost. |
./staging. |
Upload this file to BigQuery as a table (traffic_patterns_to_exclude
) for easy access and maintenance.
4. Excluding Specific URLs with LEFT JOIN
Using LEFT JOIN
and regexp_contains
, you can filter out specific URL patterns. Here’s an example query for filtering Google Analytics data:
Example Query
WITH test_traffic_removed AS ( SELECT t.* FROM `project.dataset.ga_table` AS t LEFT JOIN `project.dataset.traffic_patterns_to_exclude` AS patterns ON regexp_contains(t.page_path, patterns.regex) WHERE patterns.regex IS NULL ) SELECT * FROM test_traffic_removed;
How It Works
- Fetch all data from
ga_table
. - Use
regexp_contains
to join withtraffic_patterns_to_exclude
based on thepage_path
column. - Apply the condition
patterns.regex IS NULL
to filter out rows that match any pattern.
5. Detailed Example: Before and After Filtering
(1) Table Before Filtering (ga_table
)
(2) Table After Filtering (test_traffic_removed
)
As shown, rows with localhost
or staging
in their page_path
are excluded.
6. Summary and Benefits
- Flexibility: Use regular expressions to detect a variety of URL patterns.
- Scalability: Storing patterns in a table makes it easy to manage and update.
- Efficiency: Combining
LEFT JOIN
andregexp_contains
allows fast filtering even with large datasets.
By following this approach, even beginners can efficiently clean their data and exclude irrelevant test traffic. Mastering regular expressions and structured filtering ensures more accurate and actionable analytics.