9 phút đọc 260 lượt xem 1 thích 0 bình luận

Basic Data Quering With SQL And Pandas

Nguyễn Xuân Tiến , Phạm Cát Tân , Tô Tấn Tài , Ly Duc Hoang , Ngọc Khôi Nguyễn Lê

Tác giả chính • 4 đồng tác giả

Xuất bản: 04/04/2026

Cập nhật: 06/07/2026

I. INTRODUCTION

In the world of Data, SQL (Structured Query Language) and Pandas are two "go-to" tools that operate on fundamentally different paradigms.

This article focuses on hands-on coding, helping you directly map familiar SQL clauses to their Pandas equivalents using the Superstore dataset.

Figure 1. Query with SQL and Pandas. (source: gemini)

II. DATA STRUCTURE

2.1. Getting Familiar with Basic SQL Syntax

Figure 2. Basic SQL syntax. (source: AIO2026-Pandas&SQL Data Retrieval-Slide 14)

The basic syntax of the SELECT statement, where [...] denotes optional clauses that the user may or may not specify:

SELECT select_list
[ FROM table_source ]
[ WHERE search_condition ]
[ GROUP BY group_by_expression ]
[ HAVING search_condition ]
[ ORDER BY order_expression [ ASC | DESC ] ]

SELECT: Lists the data fields (columns) or aggregate functions on those fields to be returned by the query. Note that these fields must exist in the tables specified in the FROM clause.
FROM: Lists the data tables and/or combines them with JOIN operations for the tables being queried.
WHERE: Defines the data filtering condition.
GROUP BY: Groups the data fields to be processed.
HAVING: Defines the filtering condition applied after grouping and performing group-level calculations, if any.
ORDER BY: Specifies which data fields to sort by, and whether in ascending (ASC) or descending (DESC) order.

Other important notes:

In SQL, the basic processing order of a query statement from left to right is as follows: FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY.
Operators and functions may vary across different Relational Database Management Systems (RDBMS). For example, the || operator is used for string concatenation in Oracle, whereas + is used in MS SQL Server.
Use the special wildcard character * in the SELECT clause to retrieve all data fields.
Use IS NULL or IS NOT NULL to compare against null (empty/unknown) values.
Use the LIKE operator to match string values against a search pattern.
The AS clause can be used to rename a displayed field in the SELECT clause or to assign a shorter alias to a table in the FROM clause.
Use DISTINCT to eliminate duplicate values.
Use LIMIT to restrict the number of rows returned.

2.2. Getting Familiar with Basic Pandas Syntax

How Pandas implements the corresponding SQL clauses:

import pandas as pd

# Assuming table_source is a DataFrame (df)
result = (
    table_source[where_condition]                      # WHERE
    .groupby(group_by_expression)[select_list]         # GROUP BY & SELECT
    .sum()                                  # Aggregate function (SUM, MEAN, COUNT...)
    .loc[having_condition]                  # HAVING 
    .sort_values(by=order_expression, 
                 ascending=True)            # ORDER BY (True: ASC, False: DESC)
    .reset_index()                          # Restore index back into columns
)

df[where_condition] (Boolean Indexing): Acts as the WHERE clause, using boolean masks to filter rows satisfying a condition before any computation.
.groupby(group_by_expression): Equivalent to GROUP BY, but unlike SQL, Pandas produces a DataFrameGroupBy object that must be immediately followed by an aggregate function (such as .sum(), .mean(), .agg()) to produce a result.
[select_list]: In Pandas, column selection (SELECT) typically occurs simultaneously with the aggregate function call, or is filtered upfront to optimize memory usage.
.loc[having_condition]: Acts as HAVING. Since Pandas applies operations sequentially from top to bottom, .loc placed after the aggregate function will filter on the already-grouped result.
.sort_values(): Directly equivalent to ORDER BY.
.reset_index(): A mandatory step in Pandas to make the returned result resemble SQL's tabular structure. Because .groupby() by default converts grouped columns into an Index, .reset_index() pushes those indices back into regular data columns.

III. Hands-On Practice with the Superstore Dataset

3.1. Data Collection

In this hands-on section, we use the Superstore dataset:

You can download the dataset at: Google Drive, Kaggle.
Column descriptions for the dataset:

Column Name	Detailed Description
row_id	Unique identifier for each data row.
order_id	Unique order ID for each customer.
order_date	Date the product was ordered.
ship_date	Date the product was shipped.
ship_mode	Shipping method specified by the customer.
customer_id	Unique identifier to distinguish each customer.
customer_name	Name of the customer.
segment	Customer segment.
country	Country where the customer resides.
city	City where the customer resides.
state	State/Province where the customer resides.
postal_code	Customer's postal code.
region	Customer's geographic region.
product_id	Unique identifier for the product.
category	Primary category of the ordered product.
sub_category	Sub-category of the ordered product.
product_name	Product name.
sales	Product sales amount.
quantity	Product quantity.
discount	Discount rate applied.
profit	Profit generated or loss incurred.

3.2. Loading Data

To practice SQL queries on an online platform: Visit runsql
Switch the database engine to MySQL 8.4 and add your data.
View data details under section 2 Define Data.

Figure 3. Table creation and data import interface on RunSQL.

To practice Pandas queries on Google Colab:
Import the Pandas library, read the data, and store it in a DataFrame:

import pandas as pd
data = pd.read_csv('/content/superstore.csv')
df = pd.DataFrame(data)
df.shape   # Output: (9994, 21)

3.3. Exploring and Filtering Data

Query 1: Getting Familiar with the Data

Objective: Display the first 5 rows of data to inspect the table structure.
SQL:

SELECT *
FROM superstore
LIMIT 5

Pandas:

df.head()

Result:

Query 2: Filtering Data and Selecting Specific Columns

Objective: Preview 5 orders from the West region, showing only the Region, Product Name, and Sales of those orders.
SQL:

SELECT region, product_name, sales 
FROM superstore
WHERE region = 'West'
LIMIT 5

Pandas:

result = df.loc[df['region'] == 'West', ['region', 'product_name', 'sales']].head()

Result:

3.4 Grouping and Aggregation

Query 3: Calculating Overall KPIs

Objective: Calculate Total Revenue, Total Number of Orders (unique order_id), and Total Number of Unique Customers across the entire store.
SQL:

SELECT SUM(sales) AS revenue,
       COUNT(DISTINCT order_id) AS total_orders,
       COUNT(DISTINCT customer_id) AS unique_customers
FROM superstore;

Pandas:

pd.DataFrame({
    'revenue': [df['sales'].sum()],
    'total_orders': [df['order_id'].nunique()],
    'unique_customers': [df['customer_id'].nunique()]
})

Result:

Query 4: Total Revenue by Region

Objective: Identify which region generates the highest sales, sorted from highest to lowest.
SQL:

SELECT region, SUM(sales) AS revenue
FROM superstore
GROUP BY region
ORDER BY revenue DESC

Pandas:

(df.groupby('region')['sales']
   .sum()
   .reset_index(name='revenue')
   .sort_values(by='revenue', ascending=False)
)

Result:

Query 5: Profit by Product Category

Objective: Identify which category generates the most profit.
SQL:

SELECT category, SUM(profit) AS total_profit
FROM superstore
GROUP BY category
ORDER BY total_profit DESC

Pandas:

(df.groupby('category')['profit']
   .sum()
   .reset_index(name='total_profit')
   .sort_values(by='total_profit', ascending=False)
)

Result:

Query 6: Top 5 "Star" Products by Revenue

Objective: Find the 5 products (product_name) with the highest total sales.
SQL:

SELECT product_name, SUM(sales) AS revenue
FROM superstore
GROUP BY product_name
ORDER BY revenue DESC
LIMIT 5

Pandas:

(df.groupby('product_name')['sales']
   .sum()
   .reset_index(name='revenue')
   .sort_values(by='revenue', ascending=False)
   .head(5)
)

Result:

3.5. Multi-Dimensional Analysis and Time Series Processing

Query 7: Multi-Level Analysis (Category and Sub-Category)

Objective: Group by both category and sub_category to see which segment generates the highest revenue in detail.
SQL:

SELECT category, sub_category, SUM(sales) AS revenue
FROM superstore
GROUP BY category, sub_category
ORDER BY revenue DESC

Pandas:

(df.groupby(['category', 'sub_category'])['sales']
   .sum()
   .reset_index(name='revenue')
   .sort_values(by='revenue', ascending=False)
)

Result:

Query 8: Annual Revenue Trend Analysis

Objective: Extract the year from the order date column (order_date) and calculate total revenue per year to observe growth momentum.
SQL:

SELECT YEAR(order_date) AS order_year, SUM(sales) AS revenue
FROM superstore
GROUP BY order_year
ORDER BY order_year

Pandas:

# Convert order_date to datetime format first (if not already done)
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce') 

(df.groupby(df['order_date'].dt.year)['sales']
   .sum()
   .reset_index(name='revenue')
)

Result:

Query 9: Seasonality Analysis (Revenue by Month-Year)

Objective: Calculate total revenue broken down by individual month, but limited to data from the year 2018.
SQL: We use the DATE_FORMAT() function to reformat dates into a 'Year-Month' string. The %Y symbol represents the 4-digit year, and %m the 2-digit month.

SELECT DATE_FORMAT(order_date, '%Y-%m') AS order_month, SUM(sales) AS total_sales
FROM superstore
WHERE YEAR(order_date) = 2018
GROUP BY order_month
ORDER BY order_month

Pandas:

df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')

result = (
    df[df['order_date'].dt.year == 2018]                
    .assign(order_month=df['order_date'].dt.to_period('M'))
    .groupby('order_month')['sales']                    
    .sum()                                              
    .reset_index(name='revenue')                    
)
result

Result:

Query 10: Identifying "VIP Customers"

Objective: The Marketing team wants to send appreciation gifts to VIP customers. Find the list of customers (customer_name) whose total lifetime sales (sales) are greater than 14,400, sorted from highest to lowest.
SQL:

SELECT customer_name, SUM(sales) AS revenue
FROM superstore
GROUP BY customer_name
HAVING revenue > 14400
ORDER BY revenue DESC

Pandas:

result = (
    df.groupby('customer_name')['sales']
    .sum()
    .reset_index(name='revenue')
    .loc[lambda x: x['revenue'] > 14400] 
    .sort_values(by='revenue', ascending=False)
)

result

Result:

IV. CONCLUSION: SQL OR PANDAS? THE PRACTICAL DECISION

In Sections 2 and 3, you have likely seen that both SQL and Pandas can solve the same analytical problems. This raises the question: "In practice, which tool should we prioritize?"

The answer depends on the following 3 factors.

4.1. Hardware Constraints and Data Size (The Decisive Factor)

Pandas (RAM-bound): Pandas loads the entire dataset into the local memory (RAM) of your machine. For a data file of 100MB or 1GB, Pandas handles it very well. However, if you attempt to load a 2-Terabyte (2TB) database, your computer may freeze or throw an out-of-memory error.
SQL (Server-powered): SQL databases reside on powerful servers. They are engineered to scan, compute, and filter Terabytes of data (e.g., the entire call history of a large telecommunications company) without needing to download that data to a personal machine.

4.2. Data Storage Source and Format

Figure 4. Flat files vs. Databases

If the data already resides in a Database Management System, SQL is the best tool for interfacing with it.
If the data is provided as flat files (such as .csv, .excel), or pre-prepared datasets (such as those on Kaggle), Pandas offers an excellent ecosystem for reading and analyzing them immediately.

4.3. Purpose by Stage

Figure 5. The workflow from SQL to Pandas to Machine Learning models. (source: gemini)

SQL (Extraction Stage): The primary role is raw filtering and extraction. You use SQL to select only the data relevant to your needs from an enormous "sea" of data.
Pandas (Refinement Stage): The primary role is detailed processing. Once you have a smaller dataset in hand, Pandas excels at preprocessing, feature engineering, and preparing data for Machine Learning models.

REFERENCES

R. Elmasri and S. B. Navathe, Fundamentals of Database Systems, 7th ed. Hoboken, NJ, USA: Pearson, 2016.
The pandas development team, "pandas documentation (Version 1.4)," Feb. 2022. [Online]. Available: https://pandas.pydata.org/pandas-docs/version/1.4/pandas.pdf.

SOURCE CODE

Link: ⭐ Source Code

Tags: #data-analyst #pandas #english #python #sql

Chia sẻ: