Skip to content

How to Filter Specific URLs in SQL: Using regexp_contains

When managing or analyzing web analytics data, you often need to exclude test URLs or specific patterns from your dataset. SQL’s regexp_contains function combined with a LEFT JOIN provides an efficient way to handle this task. This guide demonstrates how to implement this approach in BigQuery with a practical example.


1. Why Filtering Specific URLs is Necessary

Suppose you are analyzing website traffic data, and your dataset contains test URLs like:

  • Traffic from the localhost environment (https://localhost/admin)
  • Traffic from the staging environment (https://staging.example.com)

These test data entries can skew your analysis of real user behavior. Removing them ensures cleaner and more reliable insights. Let’s explore how to achieve this in BigQuery.


2. What is regexp_contains?

The regexp_contains function checks whether a string matches a given regular expression. Its syntax is as follows:

regexp_contains(input_string, regex_pattern)
  • input_string: The string to check (e.g., a column value).
  • regex_pattern: The regular expression to match.

This function returns TRUE or FALSE. For example:

SELECT regexp_contains("https://localhost/admin", ".*localhost.*") AS is_match;

The result is TRUE because the string contains localhost.


3. Managing URL Patterns in a Table

To keep things organized, store the URL patterns you want to exclude in a CSV file or a table. For example, create a CSV file (traffic_patterns_to_exclude.csv) like this:

regex
.localhost.
./staging.

Upload this file to BigQuery as a table (traffic_patterns_to_exclude) for easy access and maintenance.


4. Excluding Specific URLs with LEFT JOIN

Using LEFT JOIN and regexp_contains, you can filter out specific URL patterns. Here’s an example query for filtering Google Analytics data:

Example Query

WITH test_traffic_removed AS (
    SELECT t.*
    FROM `project.dataset.ga_table` AS t
    LEFT JOIN `project.dataset.traffic_patterns_to_exclude` AS patterns
        ON regexp_contains(t.page_path, patterns.regex)
    WHERE patterns.regex IS NULL
)
SELECT *
FROM test_traffic_removed;

How It Works

  1. Fetch all data from ga_table.
  2. Use regexp_contains to join with traffic_patterns_to_exclude based on the page_path column.
  3. Apply the condition patterns.regex IS NULL to filter out rows that match any pattern.

5. Detailed Example: Before and After Filtering

(1) Table Before Filtering (ga_table)

page_path
https://localhost/admin
https://www.example.com/home
https://www.production.com/offer

(2) Table After Filtering (test_traffic_removed)

page_path
https://www.example.com/home
https://www.production.com/offer

As shown, rows with localhost or staging in their page_path are excluded.


6. Summary and Benefits

  • Flexibility: Use regular expressions to detect a variety of URL patterns.
  • Scalability: Storing patterns in a table makes it easy to manage and update.
  • Efficiency: Combining LEFT JOIN and regexp_contains allows fast filtering even with large datasets.

By following this approach, even beginners can efficiently clean their data and exclude irrelevant test traffic. Mastering regular expressions and structured filtering ensures more accurate and actionable analytics.

Tags: