In this article, we will explain two commonly used SQL query features: SELECT DISTINCT
and EXCEPT
. When working with data analysis, there are many cases where we need to remove duplicate data or exclude unnecessary columns. Let’s explore how to handle this efficiently.
■ SELECT Statement – DISTINCT: Removing Duplicates
When data contains duplicates, what we often want is to extract only unique values. This is where DISTINCT
comes into play to remove duplicates.
SELECT DISTINCT(column_name) FROM database_name;
DISTINCT
displays unique values from a specified column. For instance, let’s say we have the following table:
Link | playerID | birthCountry | birthMonth |
---|---|---|---|
1 | HGD | Korea | 12 |
2 | KCS | Korea | 2 |
3 | Sebastian | Germany | 3 |
SELECT DISTINCT(birthCountry) FROM People;
The above query retrieves the birthCountry
from the People
DB without duplicates, returning only “Korea” and “Germany”.
For example, imagine you are analyzing transaction data from various online stores. If you want to find the unique order numbers from a Christmas event, you can use the following query:
SELECT DISTINCT(order_id) FROM Transactions WHERE event = 'christmas_sale';
■ SELECT Statement – EXCEPT: Excluding Specific Columns
SELECT *
is used to retrieve all columns from a table. However, there are times when you may want to exclude certain columns. In such cases, EXCEPT
is a convenient option. The syntax is as follows:
SELECT * EXCEPT (column1, column2) FROM database_name;
For example, when analyzing transaction logs, you might want to exclude metadata columns that contain system-generated information used for tracking and managing data. Metadata refers to additional information added to the database to help manage or track the data. In other words, it’s not core business data (e.g., order numbers or customer details) but supplemental information used to manage or track the data.
This metadata is usually automatically generated or recorded by the system (e.g., DBT – Data Build Tool) and is used to track changes to the data. Some examples of metadata include:
dbt_scd_id
: A unique identifier used to track data changes. This column can be used to create a new version each time the data changes.dbt_updated_at
: A timestamp that records when the data was last modified. This helps track when the data was changed.dbt_valid_from
,dbt_valid_to
: Columns that indicate the period during which the data is valid. For instance, they can be used to track the validity of a transaction over time.
While this metadata is crucial for managing and auditing data, it is often unnecessary for analysis or reporting purposes, and you may want to exclude it. In such cases, EXCEPT
can be used.
SELECT * EXCEPT (dbt_scd_id, dbt_updated_at) FROM Transactions;