Skip to content

SQL LAG Function Example: Analyzing Cancellation Rates in BigQuery

Introduction

In this article, we will use the LAG function in BigQuery to analyze commission data. BigQuery follows the SQL standard, so most SQL functions can be used similarly.

Example: Analyzing Cancellation Rates Using Order Data

The company processes various orders, and this order data is stored in BigQuery’s orders and order_status tables. As a data analyst, you need to analyze the cancellation rate of each order. In particular, we will focus on orders that are ‘cancelled’ after being ‘pending’.

Sample Table: Orders

The orders table contains basic information about each order.

order_idcustomer_idorder_datetotal_amount
110012024-01-01150.00
210022024-01-02200.00
310032024-01-03350.00
410042024-01-04120.00

Sample Table: Order Status

The order_status table contains the status change history of each order.

order_idstatusstatus_date
1pending
2024-01-01 10:00:00
1confirmed
2024-01-02 12:00:00
2pending2024-01-02 11:00:00
2cancelled2024-01-05 14:00:00
3cancelled2024-01-04 15:00:00
4pending2024-01-04 09:00:00
4cancelled2024-01-05 14:00:00

Introduction to the LAG Function

The LAG function is a window function that allows you to access data from a previous row in the same result set. It is particularly useful for tracking changes in time-series data or status changes. The basic syntax of the LAG function is as follows:

LAG(column_name, offset, default_value) OVER (PARTITION BY partition_column ORDER BY order_column)
  • column_name: The column to retrieve the previous value from.
  • offset: The number of rows back to look (default is 1).
  • default_value: The value to return if there is no previous row (optional).
  • PARTITION BY: The column to partition the data by.
  • ORDER BY: The column to order the data by.

Introduction to the ROW_NUMBER Function

The ROW_NUMBER function is a window function that assigns a unique number to each row within a partition of the result set. It is very useful for ordering rows and assigning row numbers based on specific criteria. The basic syntax of the ROW_NUMBER function is as follows:

ROW_NUMBER() OVER (PARTITION BY partition_column ORDER BY order_column)
  • PARTITION BY: The column to partition the data by.
  • ORDER BY: The column to order the data by.

Using the LAG Function to Analyze Cancellation Rates

Now, let’s look at how to use the LAG function to track status changes for each order and calculate the cancellation rate.

Tracking the First Cancellation Status

First, we use the LAG and ROW_NUMBER functions to find the first cancellation status for each order after a pending status.

WITH first_cancelled AS (
    SELECT
        order_id,
        status,
        status_date,
        LAG(status) OVER (PARTITION BY order_id ORDER BY status_date) AS prev_status,
        ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY status_date) AS row_num
    FROM
        `project.dataset.order_status`
)
SELECT
    order_id,
    status,
    status_date,
    prev_status,
    row_num
FROM
    first_cancelled
WHERE
    status = 'cancelled' AND prev_status = 'pending' AND row_num = 2;

This query identifies records where the status changed from pending to cancelled. The LAG function tracks the previous status, and the ROW_NUMBER function assigns a sequence number to each status change within an order.

Sample Output Table from first_cancelled CTE

order_idstatusstatus_dateprev_statusrow_num
2pending
2024-01-01 10:00:00
pending2
4confirmed
2024-01-02 12:00:00
pending2

This table shows records where the order status changed from pending to cancelled. Each row includes the order ID, current status, status change date, previous status, and row number.

Calculating the Cancellation Rate

Next, we calculate the total amount and the cancelled amount for each order to determine the cancellation rate.

WITH first_cancelled AS (
    SELECT
        order_id,
        status,
        status_date,
        LAG(status) OVER (PARTITION BY order_id ORDER BY status_date) AS prev_status,
        ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY status_date) AS row_num
    FROM
        `project.dataset.order_status`
),
valid_cancelled AS (
    SELECT
        order_id,
        status,
        status_date
    FROM
        first_cancelled
    WHERE
        status = 'cancelled' AND prev_status = 'pending' AND row_num = 2
),
total_order_amount AS (
    SELECT
        order_id,
        SUM(total_amount) AS total_amount
    FROM
        `project.dataset.orders`
    GROUP BY
        order_id
),
cancelled_order_amount AS (
    SELECT
        o.order_id,
        o.total_amount
    FROM
        `project.dataset.orders` o
    JOIN
        valid_cancelled vc
    ON
        o.order_id = vc.order_id
)
SELECT
    t.order_id,
    t.total_amount AS total_order_amount,
    COALESCE(c.total_amount, 0) AS cancelled_order_amount,
    COALESCE(c.total_amount, 0) / t.total_amount AS cancellation_rate
FROM
    total_order_amount t
LEFT JOIN
    cancelled_order_amount c
ON
    t.order_id = c.order_id;

This query calculates the total amount and cancelled amount for each order and computes the cancellation rate. By tracking the first cancellation status after a pending status using the LAG function, we can accurately calculate the cancellation rate.

Common Issues and Solutions

  • Performance Issues: The LAG function uses window functions, which can lead to performance degradation when dealing with large datasets. Optimizing performance can be achieved through proper indexing and partitioning.
  • Handling NULL Values: When there is no previous row, the LAG function returns NULL by default. The default_value parameter can be used to set a default value in such cases.
LAG(status, 1, 'unknown') OVER (PARTITION BY order_id ORDER BY status_date) AS prev_status

Conclusion

In this article, we used the LAG and ROW_NUMBER functions to track status changes in order data and calculate cancellation rates. These functions are powerful tools in data analysis, allowing for more precise business insights. Understanding and using these functions can greatly enhance your ability to analyze data in SQL.

I hope this article helps you understand how to use the LAG and ROW_NUMBER functions effectively. If you have any questions or need further assistance, please feel free to reach out!