#data-science

17 phút đọc 448 lượt xem 0 thích 0 bình luận

Understand Yourself – Understand the Field: How to Learn Data Science Effectively?

Phạm Tiến , kaley kim , Thao Trinh , Lê Quang , Thái Khắc Anh Tuấn

Tác giả chính • 4 đồng tác giả

Xuất bản: 29/03/2026

Cập nhật: 29/06/2026

Introduction

Most Data Science roadmaps start with “learn Python, Pandas, ML…”, which means jumping straight into solutions while ignoring the big picture. Learners follow along but do not understand where the data flows or where the model fits, leading to fragmented knowledge. When facing real-world problems, they easily lose direction and become discouraged because they do not understand what problem they are actually solving. This blog takes the opposite approach: starting from the big picture, providing clear direction, and then diving into each specific component.

Figure 1: Illustration of Data Science as the intersection of Computer Science, Machine Learning, and Statistical Mathematics in a Venn diagram.
Source: Adapted from CodeLearn

1. How are the roles connected?

In a company environment, work and data become large and complex, making it difficult for a single person to handle everything. Therefore, collaboration between multiple roles is essential. Let’s go through each role in Data Science by considering a business problem (such as Spotify, Shopee, SaaS, etc.): “Predicting customer churn”.

1.1 Data Engineer - The “Data Pipeline” Builder

The first issues appear immediately:

Data is scattered across multiple sources (App logs | CRM, Marketing tools)
Data is dirty, inconsistent, missing, or incorrectly formatted ({02/01/2025, 2025-01-02}, NaN values)
Data updates and flows inefficiently
Data is not ready for business use

This is where Data Engineers come in. They will:

Consolidate data from multiple sources into a centralized system
Clean and standardize data
Build automated pipelines (e.g., Kafka), manage data flow, and upload to the cloud (e.g., AWS services)
Create “analytics-ready datasets”

1.2 Data Analyst - The “Data Storyteller”

Once the data is clean and accessible, a second problem appears:
No one knows what it is saying.

What are the trends in revenue and traffic?

Which customer segments are churning?

What did the latest marketing campaign achieve?

These are business questions that need to be answered using data, and require someone who can read, analyze, and explain the results clearly to decision-makers. That is the role of the Data Analyst.

1.3 Data Scientist - The Future Predictor

However, analyzing the past has its limits. Knowing that 30% of customers churn every quarter is useful, but being able to predict who is about to churn and intervene in time is far more valuable.

This is where we go further—moving into analysis, prediction, and automation using machine learning models. That is the role of the Data Scientist.

1.4 Machine Learning Engineer - Turning AI into Products

This is where the final problem appears: the model has been built and works well on a Data Scientist’s machine, but cannot be deployed into real systems. A good model that cannot be used is essentially useless.

This is where ML Engineers come in and take responsibility for:

Building APIs and handling workflows for tools (GPT, Whisper AI, Google Calendar)
Scaling systems (AWS)
Monitoring system performance and detecting failures

1.5 Overall Flow

Data Engineer → Data Analyst → Data Scientist → ML Engineer
      ↓              ↓               ↓                ↓
Clean Data  →   Insight    →     Model     → Real Product

2. The Right Roadmap: Foundations First, Specialization Later

Figure 2: Career paths in Data Science after building the foundation.
Source: AI-generated illustration.

2.1 Common Mistakes

People often choose a specific role too early and jump straight into Machine Learning without mastering foundational skills such as SQL or data processing. This leads to a misaligned learning process and gaps in core knowledge. In the long run, you may be able to build models but not truly understand the data, making it difficult to grow sustainably.

2.2 Start with Core Foundations

Basic Python (the common language across AI-related fields)
Basic SQL (a fundamental data language, suitable for beginners)
Git (track progress, experiment with models, and enable CI/CD for teamwork)
Data processing mindset (to apply the right steps based on your dataset and stage)

2.3 Overview of Each Role

After mastering the common foundation, the learning path in data begins to clearly branch out. Each role has a different focus, and choosing the right direction helps you avoid scattered learning.

Data Analyst (DA) focuses on advanced SQL, data visualization, and storytelling to turn raw numbers into meaningful insights.
Data Scientist (DS) goes deeper into statistics, machine learning, model evaluation, and feature engineering to solve predictive problems.
Data Engineer (DE) focuses on building data pipelines, working with Spark, Airflow, cloud systems, and data warehouses to ensure smooth data operations.
Machine Learning Engineer (MLE) focuses on deployment, Docker, APIs, MLOps, and monitoring to bring models into stable production environments.

Choosing the right path not only helps you learn more efficiently but also clearly defines the value you bring within a data system.

3. Choosing a Path: Which Role Fits You?

The next question is: which path should you choose? This is an important decision because it affects your entire learning journey and career direction from this point onward. Do not choose based on salary or because something is “trendy.” Choose based on your personality, genuine interests, and the long-term potential you can develop in that role (career growth, international exposure, market stability).

3.1 Key Questions

Answer the following four questions:

Question 1: Do you prefer working closer to the business side (reporting, presenting, explaining to stakeholders) or closer to technical systems (building pipelines, writing automated code)?
Question 2: Do you prefer quick results (dashboards, same-day reports) or are you willing to invest more time to build more complex products (models, systems)?
Question 3: Do you have a background in mathematics—probability, statistics, linear algebra? And do you enjoy it?
Question 4: Do you prefer working with the meaning of data (what the numbers represent) or the flow of data (where data comes from, where it goes, how fast or complete it is)?

3.2 Suggested Direction

Figure 3: Decision tree to help choose a suitable role.
Source: AI-generated illustration.

Not sure yet? That’s fine. Start with Data Analyst—it is the easiest entry point, has the most overlap with other roles, and will help you understand what you enjoy after 2–3 months of real practice.

4. Detailed Roadmap for Each Role

4.1 Data Analyst

4.1.1 What does a Data Analyst do daily?

A Data Analyst does not spend the entire day processing data or building dashboards. While data is central, they constantly communicate with other teams to understand business problems and real-world requirements. From there, they work with data and present results to stakeholders to support decision-making.

4.1.2 Learning roadmap (in order):

Stage 1: Excel

When starting with data analysis, Excel is a suitable entry point for most people. With its ability to process and visualize basic data through simple operations, Excel is widely used across many organizations. Key things to learn:

Important functions (SUM, MATCH, VLOOKUP, etc.)
Power Query
Pivot Tables

Figure 4: Example of a report created using Excel.
Source: AI VIET NAM.

Stage 2: Advanced SQL

In practice, data is rarely neatly organized. Basic SQL is only sufficient for the beginning. When handling more complex problems, you need more advanced SQL techniques such as:

Technique Group	Commands / Components
Window Functions (Analytic)	ROW_NUMBER, RANK, LAG, LEAD, OVER
Aggregation (Group Functions)	GROUP BY, ROLLUP, CUBE, GROUPING SETS
Subqueries & Table Expressions	Subquery, CTE (WITH), EXISTS
Joins (Multi-table Queries)	INNER JOIN, LEFT JOIN, CROSS JOIN, SELF JOIN
Data Transformation	PIVOT, UNPIVOT, CASE WHEN

Stage 3: BI Tools

“Reports are not for you.”

This is a key mindset every Data Analyst should have. A good report is not for the analyst to admire, but for others to understand and make decisions. Tools like Power BI and Tableau are designed for this purpose.

Criteria	Power BI	Tableau
Cost	Affordable, accessible	More expensive
Modeling	Strong (DAX, structured model)	More limited
Data Processing	Includes Power Query (light ETL)	Flexible, fast interaction
Visualization	Sufficient	Very powerful, highly customizable
Best Use	Internal BI, Microsoft ecosystem	Advanced dashboards, storytelling

Note: Focus on mastering one tool first (Power BI is more common in Vietnam). More importantly, learn to choose the right chart for the right question and present insights as a story rather than raw numbers.

Stage 4: Advanced Analytics

Applied Statistics

Statistical concepts such as mean, median, and standard deviation form the foundation for understanding and detecting data patterns. In addition, mastering advanced techniques such as t-test, chi-square, and A/B testing, along with interpreting p-values, transforms a Data Analyst from “reading data” to making data-driven conclusions and decisions.

A/B Testing

A/B Testing is a method for optimizing products by comparing a variation (B) against the original version (A) through experimentation and statistical analysis.

Stage 5: Business Analytics

A strong Data Analyst does not just build reports or dashboards, but can translate data into the business language of an organization. Business metrics and behavioral analysis techniques are the foundation for this.

Behavioral Analysis Techniques

In practice, Data Analysts often use techniques such as Cohort Analysis, Funnel Analysis, and Segmentation to understand users and support decision-making.

Figure 6: Heatmap illustrating Retention and Cumulative LTV across user cohorts over time.
Source: AI-generated illustration

Business Metrics

Metrics: Measurable indicators of a company’s performance
KPI: The most important metrics for the company
OKR: Objectives + metrics (KPIs) used to measure progress

Figure 7: Pyramid diagram illustrating the relationship between Metrics, KPI, and OKR.
Source: Compiled by the author

4.1.3 Real-world Projects

Project 1: Analyze the Superstore dataset on Kaggle — identify 3 meaningful business insights, build a Power BI dashboard, and write a one-page report with actionable recommendations.
Project 2: Design and analyze an A/B test — simulate data in Python or use an existing dataset, perform a chi-square test, and write conclusions for a Product Manager.

4.2 Data Scientist

4.2.1 What does a Data Scientist do?

A Data Scientist’s work typically revolves around collaborating with Data Analysts and other Data Scientists. They take insights from DA, translate them into machine learning problems, build models, and report results.

4.2.2 Learning roadmap (in order):

Stage 1: Probability and Statistics Foundations

Probability and statistics form the foundation of data science. Every machine learning algorithm and predictive model relies on them. Many people skip this step and jump directly into ML without understanding why models work.

Stage 2: Basic Machine Learning

Figure 8: Basic Machine Learning roadmap, starting from Linear Regression, Logistic Regression to Decision Trees and Random Forest.
Source: AI-generated illustration

For each algorithm, you need to understand three things: what problem it solves, how it works intuitively, and when to use it compared to other algorithms.

Stage 3: Model Evaluation and Improvement

A model that runs is not necessarily a good model. Therefore, Data Scientists use various techniques to evaluate and improve models, as even small mistakes can have significant real-world consequences.

Topic	Concept	Key Idea
Evaluation	Cross-validation	Evaluate model by splitting data into multiple folds
Model Behavior	Overfitting / Underfitting	Model memorizes or fails to learn
Metrics	Accuracy / Precision / Recall / F1 / AUC-ROC	Performance metrics depending on the problem
Optimization	Hyperparameter tuning (Grid, Random, Bayesian)	Optimize parameters to improve performance

Stage 4: Boosting and Feature Engineering

Boosting

Boosting is a technique that combines multiple models, where each subsequent model corrects errors from the previous one. Models like XGBoost and LightGBM often perform very well on tabular and time-series data.

Feature Engineering

Feature engineering is the process of designing and selecting features that help machine learning models perform better.

Common techniques include:

Feature Transformation
Feature Creation
Feature Selection
Feature Extraction

Stage 5: Model Interpretability

Explainable AI

LIME and SHAP are methods used to answer the question: “Why did the model produce this result?”. They ensure reliability and fairness, which are critical for real-world ML applications.

Additionally, learning causal inference is recommended to determine whether observed results are causal or coincidental.

4.2.3 Real-world Projects

Project 1: Build a house price prediction model with a full pipeline, compare at least 3 algorithms using cross-validation, use SHAP for interpretation, and write a README like a technical report.
Project 2: Participate in a Kaggle competition (winning is not required), analyze and understand the top 10% solutions after the competition, and write a comparison between their approach and yours.

4.3 Data Engineer

4.3.1 What does a Data Engineer do daily

On a daily basis, Data Engineers design and build systems to collect, process, and store large volumes of data. In addition, DEs also need to work with DA and DS to understand what data they need and prepare it for them.

4.3.2 Learning roadmap (in order)

Stage 1: Advanced SQL and database

For Data Engineers, SQL is not only for writing queries but also for building and optimizing databases. Learning advanced and system-level SQL techniques is mandatory for DEs.

Technique	Purpose	Explanation
Indexing, Query Optimization	Speed up queries	Optimize data structures and query statements.
Data Modeling (Star / Snowflake)	Make queries easier and faster	Organize data in a structured way.
Transactions	Ensure consistency	Group multiple steps into one.
ACID	Ensure data correctness	A set of rules that keep the database consistent and reliable

Stage 2: Python for Data Engineering

DEs use Python not only for modeling like DS, but to build data processing systems. To do that, DEs need to master the following:

Working with files and handling large files
Working with APIs – requests library (GET, POST, PUT, DELETE, etc.)
Logging for debugging pipelines
Config management (env variables, YAML)

Note: Code of a DE should not only run, but also be well-structured, stable, easy to debug, and easy to operate.

Stage 3: Data pipeline and Orchestration

Figure 11: Data processing workflow from ingestion to storage in a Data Warehouse.
Source: Adapted from AI VIET NAM

Pipeline is a process that is consistent, repeatable, and reliable. Key knowledge in this stage includes:

ETL and ELT pipelines

Figure 12: Illustration of ETL and ELT processes.
Source: AI VIET NAM

Transform data using SQL in the warehouse (use dbt)
DAG (Directed Acyclic Graph), Scheduling, Idempotency, orchestration, alerting automation (use Apache Airflow)
How to write data quality checks

Stage 4: Cloud and distributed computing

Figure 13: Illustration of some popular cloud platforms.
Source: AI VIET NAM

Recently, companies tend to move data systems to the cloud. This requires DEs not only to "build pipelines" but also to design and operate end-to-end data systems on cloud platforms (AWS or GCP).

Suggestion: Learners should choose one platform: AWS, GCP, or Azure to go deeper. Basic concepts include: storage (S3/GCS), compute (EC2/Compute Engine), and data warehouse (Redshift/BigQuery).

4.3.3 Real-world projects

Project 1: Build an end-to-end pipeline — get data from a public API (e.g., weather API or GitHub API), clean it, transform using dbt, store in PostgreSQL or BigQuery, schedule with Airflow to run daily, and include alerting when it fails.
Project 2: DataTalks.Club Data Engineering Zoomcamp has a very practical capstone project — recommended because of community support and review.

4.4 ML Engineer

4.4.1 What does an ML Engineer do daily

On a daily basis, ML Engineers are responsible for bringing models into production:

Packaging models into APIs
Building CI/CD pipelines
Adjusting and optimizing data workflows

At the same time, they collaborate with Data Scientists to understand model requirements and with Data Engineers to ensure input data is correctly formatted.

4.4.2 Learning roadmap (in order)

Stage 1: Software engineering fundamentals for ML

MLEs need to write better code than Data Scientists — not just “working code” but code that is readable, maintainable, and testable. Specifically:

Skill	Purpose
OOP	Clean, reusable code
Testing	Prevent bugs when changing code
Virtual environment	Avoid library conflicts
Project structure	Easier to manage and scale
Logging	Easier debugging and monitoring

Typical project structure:

project/
├── README.md
├── requirements.txt
├── src/
│   ├── ingestion/
│   ├── processing/
│   ├── models/
│   └── utils/
├── data/
│   ├── raw/
│   └── processed/
├── tests/
└── scripts/

In short, MLE work is closer to Software Engineering, so code quality is extremely important.

Stage 2: Machine learning

MLEs do not need deep theoretical knowledge like Data Scientists, but must understand:

What the model takes as input and produces as output
How the model behaves with real-world data (missing values, distribution shift, edge cases)

Also need to recognize common deployment issues:

Issue	Description
Training–serving skew	Training and production data differ
Wrong feature version	Using incorrect data version
Data leakage	Leakage during training
High latency	Slow model inference
Serialize/deserialize	Errors when saving/loading model

Stage 3: MLOps

Figure 14: Pyramid showing that most work in ML projects lies in MLOps.
Source: AI-generated illustration

MLOps is the combination of DevOps principles with Machine Learning workflows. It bridges research models and real-world systems. MLOps determines about 90% of a project's success, while the model itself is only 10%.

Figure 15: Overview of an end-to-end MLOps system.
Source: AI-generated illustration

Suggestion: MLOps is broad and requires significant time. Learn it step by step.

Stage 4: Cloud deployment

When data becomes large and models more complex, companies often move to cloud for compute, storage, and scalability. Therefore, cloud knowledge is essential for MLEs.

Category	Knowledge	Purpose	Example
Compute	VM (CPU/GPU)	Train model, run inference	EC2, GCE
Storage	Object storage	Store datasets, models, logs	S3, GCS
Database	SQL / data warehouse	Store processed data, features	BigQuery, RDS
Networking	Public IP - Port	Connect services	Expose API via VM + open port
IAM (Security)	User roles	Basic permission control	AWS IAM
Monitoring	Logs & metrics	Track errors and performance	CloudWatch, Stackdriver

4.4.3 Real-world project

Project: Take a model built by a Data Scientist (e.g., house price prediction), wrap it with FastAPI, containerize with Docker, deploy on Render, set up GitHub Actions for automatic deployment on push. The result is a public URL that anyone can call.

5. Real-world observations

5.1 Blurring skill boundaries

The general trend is that roles are becoming more hybrid, and companies expect more from individuals. Continuous learning and expanding knowledge is essential.

5.2 Soft skills are increasingly important

“Communication is the foundation of everything”

In the workplace, you will frequently interact with product, business, data, and engineering teams. Asking the right questions, clarifying requirements, and presenting results clearly are as important as building models.

5.3 System thinking

AI can now write code very well, so the advantage is no longer in coding speed or syntax knowledge, but in understanding the big picture: the problem, the data flow, and how the system operates. Instead of learning tools or models superficially, focus on system thinking — breaking down problems and designing clear pipelines.

Conclusion

Learning AI/Data is not a short-term race but a long-term accumulation process. Instead of chasing multiple trends, focusing on a clear direction, building strong fundamentals, and continuously going deeper will bring better long-term results. The difference lies not in how much you know, but in how deeply you understand and apply it.

References

Cherian, M. (2026, February 27). A/B testing guide. VWO. https://vwo.com/ab-testing/

roadmap.sh. (n.d.). AI Engineer roadmap: Step-by-step guide to becoming an AI engineer. https://roadmap.sh/ai-engineer

Nguyễn, T. H. M., Trần, T. K., Nguyễn, Q. H., Đinh, N. K., & Nguyễn, T. M. (2026).
Overview of MLOps: Technologies, tools, and processes for effective ML deployment.
https://aioconquer.aivietnam.edu.vn/posts/toan-canh-mlops-cong-nghe-cong-cu-va-quy-trinh-trien-khai-ml-hieu-qua

AI Vietnam. (2025). AIO 2025 – Module 3, 4 & 5 course materials.

Tags: #data-science

Chia sẻ: