Introduction
In this article, we will use the LAG function in BigQuery to analyze commission data. BigQuery follows the SQL standard, so most SQL functions can be used similarly.
Example: Analyzing Cancellation Rates Using Order Data
The company processes various orders, and this order data is stored in BigQuery’s orders
and order_status
tables. As a data analyst, you need to analyze the cancellation rate of each order. In particular, we will focus on orders that are ‘cancelled’ after being ‘pending’.
Sample Table: Orders
The orders
table contains basic information about each order.
order_id | customer_id | order_date | total_amount |
---|---|---|---|
1 | 1001 | 2024-01-01 | 150.00 |
2 | 1002 | 2024-01-02 | 200.00 |
3 | 1003 | 2024-01-03 | 350.00 |
4 | 1004 | 2024-01-04 | 120.00 |
Sample Table: Order Status
The order_status
table contains the status change history of each order.
order_id | status | status_date |
---|---|---|
1 | pending | 2024-01-01 10:00:00 |
1 | confirmed | 2024-01-02 12:00:00 |
2 | pending | 2024-01-02 11:00:00 |
2 | cancelled | 2024-01-05 14:00:00 |
3 | cancelled | 2024-01-04 15:00:00 |
4 | pending | 2024-01-04 09:00:00 |
4 | cancelled | 2024-01-05 14:00:00 |
Introduction to the LAG Function
The LAG
function is a window function that allows you to access data from a previous row in the same result set. It is particularly useful for tracking changes in time-series data or status changes. The basic syntax of the LAG
function is as follows:
LAG(column_name, offset, default_value) OVER (PARTITION BY partition_column ORDER BY order_column)
column_name
: The column to retrieve the previous value from.offset
: The number of rows back to look (default is 1).default_value
: The value to return if there is no previous row (optional).PARTITION BY
: The column to partition the data by.ORDER BY
: The column to order the data by.
Introduction to the ROW_NUMBER Function
The ROW_NUMBER
function is a window function that assigns a unique number to each row within a partition of the result set. It is very useful for ordering rows and assigning row numbers based on specific criteria. The basic syntax of the ROW_NUMBER
function is as follows:
ROW_NUMBER() OVER (PARTITION BY partition_column ORDER BY order_column)
PARTITION BY
: The column to partition the data by.ORDER BY
: The column to order the data by.
Using the LAG Function to Analyze Cancellation Rates
Now, let’s look at how to use the LAG
function to track status changes for each order and calculate the cancellation rate.
Tracking the First Cancellation Status
First, we use the LAG
and ROW_NUMBER
functions to find the first cancellation status for each order after a pending status.
WITH first_cancelled AS ( SELECT order_id, status, status_date, LAG(status) OVER (PARTITION BY order_id ORDER BY status_date) AS prev_status, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY status_date) AS row_num FROM `project.dataset.order_status` ) SELECT order_id, status, status_date, prev_status, row_num FROM first_cancelled WHERE status = 'cancelled' AND prev_status = 'pending' AND row_num = 2;
This query identifies records where the status changed from pending
to cancelled
. The LAG
function tracks the previous status, and the ROW_NUMBER
function assigns a sequence number to each status change within an order.
Sample Output Table from first_cancelled CTE
order_id | status | status_date | prev_status | row_num |
---|---|---|---|---|
2 | pending | 2024-01-01 10:00:00 | pending | 2 |
4 | confirmed | 2024-01-02 12:00:00 | pending | 2 |
This table shows records where the order status changed from pending
to cancelled
. Each row includes the order ID, current status, status change date, previous status, and row number.
Calculating the Cancellation Rate
Next, we calculate the total amount and the cancelled amount for each order to determine the cancellation rate.
WITH first_cancelled AS ( SELECT order_id, status, status_date, LAG(status) OVER (PARTITION BY order_id ORDER BY status_date) AS prev_status, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY status_date) AS row_num FROM `project.dataset.order_status` ), valid_cancelled AS ( SELECT order_id, status, status_date FROM first_cancelled WHERE status = 'cancelled' AND prev_status = 'pending' AND row_num = 2 ), total_order_amount AS ( SELECT order_id, SUM(total_amount) AS total_amount FROM `project.dataset.orders` GROUP BY order_id ), cancelled_order_amount AS ( SELECT o.order_id, o.total_amount FROM `project.dataset.orders` o JOIN valid_cancelled vc ON o.order_id = vc.order_id ) SELECT t.order_id, t.total_amount AS total_order_amount, COALESCE(c.total_amount, 0) AS cancelled_order_amount, COALESCE(c.total_amount, 0) / t.total_amount AS cancellation_rate FROM total_order_amount t LEFT JOIN cancelled_order_amount c ON t.order_id = c.order_id;
This query calculates the total amount and cancelled amount for each order and computes the cancellation rate. By tracking the first cancellation status after a pending status using the LAG
function, we can accurately calculate the cancellation rate.
Common Issues and Solutions
- Performance Issues: The
LAG
function uses window functions, which can lead to performance degradation when dealing with large datasets. Optimizing performance can be achieved through proper indexing and partitioning. - Handling NULL Values: When there is no previous row, the
LAG
function returns NULL by default. Thedefault_value
parameter can be used to set a default value in such cases.
LAG(status, 1, 'unknown') OVER (PARTITION BY order_id ORDER BY status_date) AS prev_status
Conclusion
In this article, we used the LAG
and ROW_NUMBER
functions to track status changes in order data and calculate cancellation rates. These functions are powerful tools in data analysis, allowing for more precise business insights. Understanding and using these functions can greatly enhance your ability to analyze data in SQL.
I hope this article helps you understand how to use the LAG
and ROW_NUMBER
functions effectively. If you have any questions or need further assistance, please feel free to reach out!