Skip to content

SQL SELECT – Removing Duplicates and Excluding Specific Columns

In this article, we will explain two commonly used SQL query features: SELECT DISTINCT and EXCEPT. When working with data analysis, there are many cases where we need to remove duplicate data or exclude unnecessary columns. Let’s explore how to handle this efficiently.

■ SELECT Statement – DISTINCT: Removing Duplicates

When data contains duplicates, what we often want is to extract only unique values. This is where DISTINCT comes into play to remove duplicates.

SELECT DISTINCT(column_name) FROM database_name;

DISTINCT displays unique values from a specified column. For instance, let’s say we have the following table:

LinkplayerIDbirthCountrybirthMonth
1HGDKorea12
2KCSKorea2
3SebastianGermany3
SELECT DISTINCT(birthCountry) FROM People;

The above query retrieves the birthCountry from the People DB without duplicates, returning only “Korea” and “Germany”.

For example, imagine you are analyzing transaction data from various online stores. If you want to find the unique order numbers from a Christmas event, you can use the following query:

SELECT DISTINCT(order_id) FROM Transactions WHERE event = 'christmas_sale';

■ SELECT Statement – EXCEPT: Excluding Specific Columns

SELECT * is used to retrieve all columns from a table. However, there are times when you may want to exclude certain columns. In such cases, EXCEPT is a convenient option. The syntax is as follows:

SELECT * EXCEPT (column1, column2) FROM database_name;

For example, when analyzing transaction logs, you might want to exclude metadata columns that contain system-generated information used for tracking and managing data. Metadata refers to additional information added to the database to help manage or track the data. In other words, it’s not core business data (e.g., order numbers or customer details) but supplemental information used to manage or track the data.

This metadata is usually automatically generated or recorded by the system (e.g., DBT – Data Build Tool) and is used to track changes to the data. Some examples of metadata include:

  • dbt_scd_id: A unique identifier used to track data changes. This column can be used to create a new version each time the data changes.
  • dbt_updated_at: A timestamp that records when the data was last modified. This helps track when the data was changed.
  • dbt_valid_from, dbt_valid_to: Columns that indicate the period during which the data is valid. For instance, they can be used to track the validity of a transaction over time.

While this metadata is crucial for managing and auditing data, it is often unnecessary for analysis or reporting purposes, and you may want to exclude it. In such cases, EXCEPT can be used.

SELECT * EXCEPT (dbt_scd_id, dbt_updated_at) FROM Transactions;

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *