Skip to content

[BigQuery] Mastering SQL for JSON Data Handling: A Practical Guide

Introduction: The Significance of JSON in Data Analysis

In today’s digital age, JSON (JavaScript Object Notation) plays a pivotal role in data storage and transmission across various platforms, from web APIs to cloud services. This guide aims to explore the efficient handling of JSON data within BigQuery, focusing on the JSON_EXTRACT_SCALAR and UNNEST functions through practical examples.

Understanding JSON

JSON is a lightweight data-interchange format that is easy for humans to read and write, and simple for machines to parse and generate. It organizes data into key-value pairs, making it an ideal candidate for complex data structures often encountered in big data analytics.

Case Study: Analyzing Sales Performance with Promotion and Coupon Data

Let’s say our company runs various promotional campaigns, issuing unique coupons that provide customers with various discounts. These coupons are stored in BigQuery’s promotions table in JSON format. Below is a sample structure of the promotions table:

  • campaign_id: Unique identifier for the campaign
  • campaign_name: Name of the campaign
  • start_date: Start date of the campaign
  • end_date: End date of the campaign
  • coupon_data: JSON formatted string containing details such as the coupon’s ID, discount, validity, and code.

Example of coupon_data in JSON format:

[
  {"id": "C123", "discount": "20%", "valid_from": "2024-05-05", "valid_to": "2024-05-10", "code": "MOTHER20"},
  {"id": "C456", "discount": "15%", "valid_from": "2024-05-01", "valid_to": "2024-05-31", "code": "MAY15"}
]

This is a preview of the promotions table:

campaign_idcampaign_namestart_dateend_datecoupon_data
1MAY promotion2024-05-012024-05-31{ “id”: “C123”, “discount”: “20%”, “valid_from”: “2024-05-05”, “valid_to”: “2024-05-10”, “code”: “MOTHER20” }, { “id”: “C456”, “discount”: “15%”, “valid_from”: “2024-05-01”, “valid_to”: “2024-05-31”, “code”: “MAY15” }

Extracting Data with JSON_EXTRACT_SCALAR

The JSON_EXTRACT_SCALAR function is designed to retrieve scalar values (such as strings or numbers) from a JSON object based on a specified key. This function facilitates the direct querying of values within JSON fields embedded in SQL queries.

For instance, to extract the coupon ID from the coupon_data field stored in JSON format, you might attempt the following SQL query:

SELECT
  JSON_EXTRACT_SCALAR(coupon_data, '$.id') AS coupon_id
FROM
  promotions

However, this query results in an error:

No matching signature for function JSON_EXTRACT_SCALAR for argument types: ARRAY<STRING>, STRING. Supported signatures: JSON_EXTRACT_SCALAR(STRING, [STRING]); JSON_EXTRACT_SCALAR(JSON, [STRING]) at [5:3]

This error occurs because the coupon_data field is stored as an array of strings, not as individual JSON objects. The JSON_EXTRACT_SCALAR function cannot directly process array inputs.

Resolving Issues with UNNEST

The UNNEST function is crucial for dealing with arrays in BigQuery. It transforms each element of an array into a separate row, enabling individual processing:

SELECT
  JSON_EXTRACT_SCALAR(c, '$.id') AS coupon_id
FROM
  promotions,
  UNNEST(coupon_data) AS c

By using UNNEST, the coupon_data array is broken down into individual JSON strings, and JSON_EXTRACT_SCALAR is then applied to extract the ID from each string, effectively retrieving each coupon’s ID.

Practical Analysis: Calculating Usage and Discounts

Now, let’s use this data to calculate the number of times each coupon was used and the total discount offered per campaign:

SELECT
  campaign_id,
  campaign_name,
  COUNT(JSON_EXTRACT_SCALAR(c, '$.id')) AS total_coupons_used,
  SUM(CAST(JSON_EXTRACT_SCALAR(c, '$.discount') AS INT)) AS total_discount_offered
FROM
  promotions,
  UNNEST(coupon_data) AS c
GROUP BY
  campaign_id, campaign_name

This query demonstrates how combining UNNEST and JSON_EXTRACT_SCALAR can yield valuable insights into promotional effectiveness and guide more informed marketing strategies.

Troubleshooting Common Issues

When processing JSON data, it’s common to encounter path errors if the JSON path is not correctly specified. Accurately defining the JSON path is crucial for correct data extraction. Additionally, performance considerations are paramount when handling large datasets; queries should be optimized to minimize data extraction and maintain simplicity.

This guide offers a foundational understanding of handling JSON data in BigQuery, equipping you with the knowledge to perform sophisticated data analyses and optimize your SQL queries for better performance.