#loss-function #computer-vision #semantic-segmentation

22 phút đọc 449 lượt xem 2 thích 0 bình luận

Loss Functions for Image Segmentation

Ngô Hoàng Long , Tâm Hoàng , Minh Trí Phan Thế , Bùi Phú Khoa , Trần Hà Hoàng Khanh , trang pham thi thu

Tác giả chính • 5 đồng tác giả

Xuất bản: 22/03/2026

Cập nhật: 23/06/2026

1. Introduction to the Loss Function

1.1. Definition

In Artificial Intelligence and Machine Learning, the loss function is a central concept used to measure the degree of discrepancy between a model’s prediction and the actual target value. Put simply, the loss function tells us how ‘wrong’ the model is: the closer the model’s prediction is to the true label, the smaller the loss value; conversely, the further the prediction deviates from reality, the larger the loss. Therefore, the loss function serves as an immediate measure of the model’s quality during the learning process.

In general, the loss function can be defined via the following mapping:

$$ \begin{matrix} \mathcal{L} : \mathcal{Y} \times \widehat{\mathcal{Y}} \rightarrow \mathbb{R}_{\geq 0} \\[1ex] (y, \hat{y}) \mapsto \mathcal{L}(y, \hat{y}) \end{matrix} $$
Where:
- $\mathcal{L}$: The notation for the loss function.
- $y \in \mathcal{Y}$: The true label (ground truth), và $\mathcal{Y}$ is the set containing all true labels.
- $\hat{y} \in \hat{\mathcal{Y}}$: The value predicted by the model and $\hat{\mathcal{Y}}$ is the set containing all the values predicted by the model.

Put simply, a loss function is a function that maps pairs of actual values and their corresponding model predictions to a real number that is non-negative.

In fact, when training a model, the objective of the learning algorithm is to find a set of parameters $\theta$ such that the loss value over the entire dataset is minimised. If we consider the $N$ observations in the dataset, we are faced with a general optimisation problem, which is typically formulated as follows:
$$ \min_{\theta} \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(y_i, f_\theta(x_i)) $$

Here, $f_{\theta}(x_i)$ is the model’s prediction for the input $x_i$. The above expression also forms the basis of the concept of Empirical Risk Minimisation in Machine Learning, meaning that we enhance the model’s learning ability by minimising the average error across the entire observed dataset.

1.2. Significance

The significance of the loss function in AI/ML is of paramount importance, as it does not merely serve to measure error, but also guides the model’s learning process. During training, the model cannot distinguish between what constitutes a ‘good’ prediction and what constitutes a ‘bad’ one; at this stage, the loss function provides that distinction to the model.

By utilising the gradient of the loss function, optimisation algorithms (such as Gradient Descent or Adam) continuously update the weights, guiding the model towards minimising the error to its minimum value. If the wrong loss function is chosen, the model may converge well mathematically but prove completely useless in practical applications, as it has been trained with the wrong core objective of the problem from the outset.

1.3. How to determine the appropriate loss function

Essentially, determining or selecting a loss function is not simply a matter of mechanically choosing a common formula tailored to the problem; rather, it involves determining what the model needs to learn, which types of error should be penalised ‘severely’, and which evaluation metric accurately reflects the objective of the problem. In other words, a suitable loss function must be constructed or selected in such a way that it accurately encodes the nature of the machine learning task at hand.

Typically, the process of defining a loss function can be approached by asking the following questions:

i. What is the output of the problem?

The loss function must first and foremost be fully compatible with the mathematical structure of the output space.

Regression: When the output consists of continuous values, we are concerned with numerical distances. Metrics such as Mean Squared Error (MSE) or Mean Absolute Error (MAE) are the standard choices.
Classification Problem: When the output consists of discrete labels, we are concerned with the probability distribution of correct class predictions. The most prominent measure in this category is the Cross-Entropy function.

ii. Which type of error warrants the most severe penalty?

Not all errors are equally serious. In some problems, incorrect predictions in certain aspects may not significantly affect the model’s performance, yet lead to unforeseen consequences in other aspects; for example, in biomedical applications, failing to detect a few small damaged pixels (False Negatives) is more serious than incorrectly identifying a few irrelevant background pixels (False Positives).

iii. Is the loss function compatible with the evaluation metrics?

In certain tasks such as classification, the metrics used to evaluate the model almost never coincide with the loss function. For this reason, when selecting a loss function for any given task, we must ensure that there is a strict monotonic relationship between the loss function and the chosen metrics (i.e. as the loss decreases, the model’s performance according to the metric must improve).

For example, if the quality of a segmentation task is assessed using the Dice Score, directly using Dice Loss during training will help the model align most closely with the evaluation objective: as the loss decreases, the actual metric is certain to increase.

iv. Can the loss function be optimised stably?

In addition to aspects related to the problem at hand, the chosen loss function must also be mathematically stable. A good loss function typically requires a useful gradient, sufficiently smooth derivatives, or at least sufficient stability to allow optimisation algorithms to learn effectively. In practice, there are loss functions that are intuitively sound but difficult to optimise mathematically or computationally, particularly when the data is heavily imbalanced, or when the output has a complex structure.

For this reason, choosing a loss function always involves striking a balance between two requirements:
- Accurately reflecting the problem’s objective.
- Enabling efficient optimisation during training.

2. Common Loss Functions for Image Segmentation

2.1. Introduction to Image Segmentation

Image segmentation is the task of assigning labels to each pixel in an image, such that pixels with the same semantic meaning are grouped into the same region. Unlike standard image classification tasks, where the model only needs to predict a single label for the entire image, segmentation requires the model to produce a dense prediction map, meaning that for every pixel location, a decision must be made as to which class that pixel belongs to. It can be said that this is one of the foundational problems of modern computer vision, due to its high applicability and widespread use across many fields such as: biomedical science, autonomous vehicles, surveillance, satellite imagery, etc.

IoU Loss Simulation — **Image -** Two Subcategories of Image Segmentation Problems (Semantic vs. Instance Segmentation) – Anurag Arnab, Shuai Zheng et al. 2018 (source: IEEE)

Given an input image of size $H \times W$, in a multi-class segmentation problem with $C$ classes, the ground-truth values can be represented as:
$$ y \in \{0, 1, \dots, C - 1\}^{H \times W} $$
in which each pixel carries a class label. Accordingly, the model’s output is typically a probability tensor:
$$ \hat{y} \in [0, 1]^{H \times W \times C} $$
with the constraint:
$$ \sum_{c=1}^{C} \hat{y}_{ijc} = 1 \quad \forall (i, j) $$
In other words, for each pixel $(i, j)$, the model will produce a probability distribution across $C$ classes.

However, the segmentation problem is not merely a matter of correctly classifying each pixel; in fact, a good segmentation result must also ensure that the predicted regions have a reasonable shape, closely follow the object, and accurately reproduce the boundaries of each object. This is why the choice of loss function in segmentation is far more important than in many other machine learning tasks. It can be said that the effectiveness of a segmentation model depends not only on the architecture of the network but also heavily on the objective function—that is, the loss function chosen for training. The importance of the loss function in the image segmentation problem is even more evident when challenges such as class imbalance, small objects, complex boundaries, or label noise frequently arise in real-world problems (particularly in biomedical applications).

2.2. Loss Functions for Image Segmentation

In Semantic Segmentation, the loss function can be understood as the link between the predicted mask and the ground-truth mask, as it indicates how far off the model is, whilst providing gradient signals to adjust parameters during training. However, not all loss functions ‘view’ the segmentation problem in the same way. Some loss functions focus on pixel-level accuracy; others are concerned with the degree of overlap between the entire predicted region and the ground truth region.

According to Azad et al., loss functions in image segmentation are classified into the following three common types:
- Pixel-level loss functions: calculated on a pixel-wise basis, i.e. measuring the independent difference between the prediction and the ground truth at each pixel.

Region-level loss functions: focus on the overall match between the predicted mask and the ground-truth mask, emphasising the degree of overlap rather than individual pixels.
Boundary-level loss functions: specifically address the accuracy of object boundaries.

In this blog post, we will focus solely on the following three common loss functions:
- Cross-Entropy Loss and Focal Loss as two representative loss functions of the pixel-level group,
- IoU Loss as a representative loss function of the region-level group.

2.2.1. Pixel-level Loss Function

According to Azad et al., the Pixel-level Loss group operates directly at the level of individual pixels. Their core objective is to ensure accurate classification for each pixel within the segmented regions.

Specifically, this class of loss functions calculates the discrepancy between the model’s predicted value and the ground-truth label entirely independently for each pixel. Essentially, Pixel-level Loss treats the image segmentation problem as a dense classification problem. Thanks to its strong focus on fine-grained pixel-wise accuracy, it excels in tasks requiring detailed object recognition (e.g., detecting tumours with very low coverage relative to the background in biomedical applications).

Key advantages:
- Direct and stable signal processing: Calculating the error directly at the pixel level provides the model with extremely clear gradient values. The model accurately identifies which pixels are misclassified to adjust weights in a timely manner, ensuring a smooth convergence process. Thanks to their ease of implementation and optimisation, they are commonly used as the standard baseline for most segmentation tasks.
- Flexible handling of class imbalance: Although calculations are performed independently on each pixel, this group offers highly precise priority-based intervention variants. A typical example is Weighted Cross-Entropy Loss, which allows weights to be allocated inversely to the class’s occurrence frequency, ensuring that minority classes (such as small tumours) receive higher weights so the model cannot ignore them. Conversely, Focal Loss acts as a filter, actively reducing the influence of easily predictable patterns and forcing the model to concentrate all its resources on learning the difficult patterns.

Limitations and solutions:
- Ignores overall geometric structure: As it essentially accumulates local errors, this class of loss functions fails to reflect the geometric quality or integrity of the entire object region. Optimisation based on the aggregate statistics of individual pixels may result in softer segmentation boundaries. Therefore, rather than being used in isolation, Pixel-level Loss is typically integrated into a combined approach (Combo Loss), blended with Region-level or Boundary-level loss to achieve a perfect balance between pixel-level accuracy and the quality of the entire object.

2.2.1.1. Cross-Entropy Loss

Cross-Entropy Loss is one of the most fundamental and widely used loss functions in classification tasks, including image segmentation. According to Azad et al., Cross-Entropy measures the difference between two probability distributions: the true label distribution and the model’s predicted probability distribution. In segmentation, after applying the softmax function, the model generates a pixel-wise probability map, meaning that for each pixel, the model predicts the probability that the pixel belongs to each class. The loss function is then calculated by taking the negative logarithm of the probability corresponding to the true class at each pixel. As the probability of the true class approaches $1$, the Cross-Entropy approaches $0$.

Let $t_n$ denote the one-hot vector representing the true label of the $n$th pixel, and $y_n$ the predicted probability vector at that pixel; the Cross-Entropy Loss formula is written as follows:
$$ \mathcal{L}_{CE}(y, t) = -\sum_{n=1}^{N} \log(t_n \cdot y_n) $$

Since $t_n$ is a one-hot vector, in practice only the probability of the correct class affects the loss. This means that the model will be heavily penalised if it assigns a low probability to the correct class, and will be ‘rewarded’ when it assigns a high probability to the correct class.

The intuition behind Cross-Entropy is quite clear: it forces the model to become more confident when it predicts correctly, whilst being heavily penalised when it is confident but predicts incorrectly. Therefore, Cross-Entropy is a very natural choice when we view the Image Segmentation problem as a label classification problem for each pixel. This is also why Cross-Entropy is often regarded as a strong, stable and highly generalisable baseline.

In cases of class-imbalanced data, we can use Weighted Cross-Entropy, in which each class is assigned a different weight to balance their influence on the overall loss. The formula for this variant is:
$$ \mathcal{L}_{WCE}(y, t, w) = -\sum_{n=1}^{N} t_n \cdot w \log(t_n \cdot y_n) $$

Here, $w$ is the class weight vector. If all weights are equal to 1, we obtain standard cross-entropy. The significance of this variant is that rarer classes can be assigned higher weights so that the model is not biased towards any dominant class in the dataset. In practical segmentation problems, particularly in medical or satellite imagery, this is a crucial technique for mitigating the impact of class imbalance.

In summary, Cross-Entropy has the advantages of simplicity, stability and good optimisation capabilities. However, as it still operates at the level of individual pixels, it does not directly optimise the degree of alignment between the two masks at the regional level. This is precisely the motivation behind the development of variants such as Focal Loss.

2.2.1.2. Focal Loss

Focal Loss was proposed as an extension of Cross-Entropy to better address data imbalance and the dominance of easy samples. In essence, Focal Loss is a modified version of Cross-Entropy, in which easy samples — that is, pixels already correctly classified with high probability — are given lower weights, whilst difficult samples or those currently misclassified are emphasised. Thanks to this mechanism, the model does not need to expend computational resources on pixels that are already easy to classify, but instead focuses more on areas that are genuinely difficult to segment.

The formula for Focal Loss is:
$$ \mathcal{L}_{Focal}(y, t, \gamma) = -\sum_{n=1}^{N} (1 - t_n \cdot y_n)^\gamma \log(t_n \cdot y_n) $$

In this context, $\gamma \ge 0$ is an adjustable hyperparameter, commonly referred to as the focusing parameter. The term $(1 - t_n \cdot y_n)^\gamma$ is precisely what causes the Focal Loss to behave differently from standard Cross-Entropy.

The intuition behind Focal Loss is as follows: if a pixel has been predicted by the model with a high degree of accuracy—that is, if the probability assigned to the correct class is high—then $(1 - t_n \cdot y_n)$ will be small, resulting in a correspondingly small adjustment factor. Consequently, the contribution of this pixel to the total loss is reduced. Conversely, if a pixel is predicted incorrectly or the model is uncertain about its prediction, the probability of the correct class will be lower, causing the adjustment factor to be larger and resulting in a heavier penalty for that pixel. This causes Focal Loss to tend to focus more strongly on difficult pixels, rare regions, or small objects that are hard to separate from the background.

An important property is that when:
$$ \gamma = 0 $$
then Focal Loss becomes Cross-Entropy. For this reason, Focal Loss can be viewed as a generalised form of Cross-Entropy, in which the coefficient $\gamma$ determines the degree of priority given to difficult samples.

In image segmentation, Focal Loss is particularly useful when the data exhibits severe class imbalance, or when there are too many easily classifiable background pixels, causing the model to quickly ‘settle’ for the easy regions whilst neglecting the more challenging ones. For datasets such as medical images, satellite imagery, or problems involving small objects, Focal Loss often significantly improves prediction quality in difficult regions and for minority classes. However, its effectiveness depends heavily on the choice of the $\gamma$ value, so this is a loss function that requires careful tuning during experimentation.

2.2.2. Boundary-level Loss Function

Unlike Pixel-level Loss, which focuses solely on evaluating local errors at individual pixels, Region-level Loss approaches the problem from a more macro perspective. According to research by Azad et al., this class of loss functions prioritises the accuracy of the entire object region (object completeness) by maximising the connection and overlap between the prediction mask and the ground-truth mask.

In other words, rather than counting how many pixels are correct or incorrect, they assess the quality and integrity of the entire segmented region as a whole.

This approach offers significant advantages for the image segmentation task:
- Accurately reflects the nature of real-world evaluation: In practice, segmentation tasks are often evaluated using metrics based on overlap. The direct use of Region-level Loss ensures that the training process aligns closely with the final evaluation objective.
- Naturally handles class imbalance: A pixel-level evaluation model is easily ‘tricked’ into correctly predicting a vast background region whilst misclassifying the main object region. Region-level Loss addresses this issue by treating a set of pixels as a unified block, forcing the model not to ignore minority classes or extremely small objects.

Sharper segmentation boundaries: By focusing on the agreement between the predicted and ground-truth regions, many loss functions in this group help produce sharper and more detailed boundaries compared to pixel-level methods.

However, every strength comes with its own challenges:
- Difficulty in convergence and poor stability: The greatest limitation of the Region-level group is that the optimisation process becomes more difficult. Particularly in the early stages of training, if the model’s predicted region and the ground-truth region have minimal overlap, the vanishing gradient problem is highly likely to occur, causing the model to lose its learning direction.

To gain a better understanding of how this overlap is mathematically modelled, the most representative example of the Region-level group that we will examine in depth is the IoU Loss (or Jaccard Loss).

2.3. IoU Loss

IoU Loss is derived from the Intersection over Union (IoU) metric, a widely used measure also known as the Jaccard Index. In the task of Semantic Segmentation, IoU is one of the most important and stringent metrics because it directly assesses the actual degree of overlap between the prediction mask and the ground-truth mask.

Mathematically, the IoU metric is calculated as the ratio of the size of the intersection to the size of the union of these two regions. The basic formula is:
$$ IoU = \frac{|Y \cap T|}{|Y \cup T|} $$
where:
- $Y$ is the predicted mask
- $T$ is the ground-truth mask

In multi-class segmentation, IoU is typically calculated separately for each class and then averaged to produce the mean IoU ($mIoU$). As this is a widely used metric for evaluating segmentation quality, constructing a loss function based directly on IoU is a very natural approach.

However, the $mIoU$ metric alone cannot be directly used as a loss function to train a model because it simply counts binary pixels ($0$ or $1$), making it impossible to compute a derivative. To address this issue, IoU Loss (proposed by Rahman et al.) was introduced as a relaxed and differentiable version of $mIoU$.

Its formula is:
$$ \mathcal{L}_{IoU} = 1 - \frac{1}{C} \sum_{c=0}^{C-1} \frac{\sum_{n=1}^{N} t_n^c y_n^c}{\sum_{n=1}^{N} (t_n^c + y_n^c - t_n^c y_n^c)} $$
where:
- $C$ is the total number of classes the model needs to classify.
- $N$ is the total number of pixels in the image under consideration.
- $t_n^c$: Ground-truth label. This is a binary value: it is 1 if the $n$th pixel actually belongs to class $c$, and 0 otherwise.
- $y_n^c$: Prediction. This is the probability output by the model (typically ranging from 0 to 1) indicating the level of confidence that the $n$th pixel belongs to class $c$.

Rather than asking the micro-level question “Is this pixel correct or incorrect?”, IoU Loss approaches the problem with a more macro-level perspective: “Does the entire predicted region match the ground-truth region?”. This approach is entirely consistent with how humans assess the quality of a segmentation mask with the naked eye. The mechanism is very intuitive: if the predicted region and the ground-truth region overlap well, the IoU score will be high and the loss value will be low. Conversely, if the two regions are significantly misaligned, resulting in the intersection being too small compared to the union, the loss will increase to penalise the model.

3. Demonstration of Loss Functions

3.1. Introduction to the dataset: Semantic Segmentation of Aerial Imagery (Dubai Aerial Dataset)

The dataset selected for this demonstration is the Semantic Segmentation of Aerial Imagery, referred to as the Dubai Aerial Dataset. This is a semantic segmentation dataset based on satellite/aerial imagery of the city of Dubai, published by Humans in the Loop in collaboration with the Mohammed Bin Rashid Space Centre (MBRSC).

The dataset comprises 72 high-quality satellite images, divided into 8 large tiles. Each image is accompanied by a segmentation label (segmentation mask) meticulously assigned at the pixel level.

Semantic Classes

The most notable feature of the Dubai Aerial Dataset is its multi-class labelling. Each pixel in the image is classified into one of six different semantic classes, including:
- Building

Land (unpaved area)
Road
Vegetation
Water
Unlabelled

Below is an illustration of the input and ground-truth:

3.2. Comparison of prediction results across loss functions

In this demonstration, we will compare the prediction results of each loss function presented in this blog. The model used will be UNet, with N_EPOCHS = 100 and BATCH_SIZE = 16. The link to the demo notebook can be found here.

When directly comparing the segmentation results of the three models using Cross-Entropy Loss, Focal Loss and IoU Loss, it can be seen that each loss function produces a different type of prediction, accurately reflecting their respective optimisation principles. Of the three methods, IoU Loss yields the best visual results. The predicted masks from the model using IoU Loss have higher coverage, capture the object regions more completely, and the segmented regions appear smoother and more seamless compared to the other two loss functions. This is reasonable because IoU Loss is a region-level loss, meaning it directly optimises the degree of overlap between the predicted mask and the ground-truth. Precisely because it is designed to focus on the quality of regions at the overall level, IoU Loss tends to produce segmented regions that are fuller, more stable, and have a more natural shape.

Image description 1 — **Image -** Prediction results using IoU Loss with the input image (far left), ground-truth image (centre), and predicted image (far right).

Image description 2 — **Image -** Prediction results using IoU Loss with the input image (far left), ground-truth image (centre), and predicted image (far right).

For Focal Loss with the parameter
$$ \gamma = 0.75 $$
The model demonstrates fairly good recognition performance on difficult pixels, whilst the boundaries between predicted regions are also rendered more clearly than with IoU Loss. In other words, Focal Loss appears to be more sensitive to details that are difficult to classify and tends to highlight object edges more effectively. However, a limitation is that although the boundaries are clearer, the model still exhibits mislabelling in some regions. This suggests that Focal Loss does indeed help the model focus on difficult pixels and reduce the influence of overly easy regions, but that focus does not necessarily translate entirely into better overall segmentation quality. As a result, the Focal-generated prediction mask looks quite good, particularly at the local detail level, but it still lacks semantic stability compared to IoU Loss. This is entirely understandable given the high label imbalance in each image, whilst the dataset’s sample size remains limited (only 72 images).

Meanwhile, Cross-Entropy Loss yields less convincing results than the other two loss functions. Models using Cross-Entropy can still detect the presence of difficult regions or complex pixels in an image, but they do not perform well in distinguishing boundaries between classes. Consequently, the predicted regions are often less sharp, prone to blurred edges and lack geometric consistency. This is also a fairly typical characteristic of Cross-Entropy in segmentation: as a pixel-level loss, it optimises each pixel independently; so whilst it can learn local classification information, it does not directly encourage the model to produce segmentation regions with good coverage or smooth, clear structures at the global level.

From the above observations, it can be seen that both IoU Loss and Focal Loss produce significantly better prediction results than Cross-Entropy Loss, but in two different ways. IoU Loss excels at generating segmentation regions with better coverage, greater continuity and a more natural appearance, whilst Focal Loss excels at emphasising difficult pixels and clarifying boundaries. Overall, both loss functions produce visually good results and can be considered relatively comparable in terms of the aesthetic quality of the predictions; however, IoU Loss has the advantage in terms of coverage and the smoothness of the predicted regions. In contrast, Cross-Entropy Loss appears to be at a disadvantage because, although it can still identify difficult regions, it is not robust enough to produce masks with clear edges and good segmentation structure like the other two loss functions.

In summary, in this experiment, it can be concluded that IoU Loss and Focal Loss are both more suitable for the segmentation task than Cross-Entropy Loss. If the priority is a segmentation result with good coverage, a complete and smooth prediction region, then IoU Loss is the more prominent choice. If greater emphasis is placed on clarifying difficult pixels and object boundaries, Focal Loss also demonstrates a notable advantage. Meanwhile, Cross-Entropy Loss primarily serves as a reasonable baseline, but does not achieve the same level of visually high-quality segmentation as the other two loss functions.

Reference

[1] VinBigData, ‘Image Segmentation: Useful Open-Source Algorithms and Databases,’ vinbigdata.com, 2021. [Online]. Available: https://vinbigdata.com/camera-ai/phan-vung-anh-cac-thuat-toan-va-co-so-du-lieu-ma-nguon-mo-huu-ich.html. [Accessed: 15-Mar-2024].

[2] R. Azad et al., "Medical Image Segmentation Review: The Success of U-Net," arXiv preprint arXiv:2312.05391, 2023. [Online]. Available: https://arxiv.org/abs/2312.05391.

[3] S. S. Maicas, G. Carneiro, A. P. Bradley, J. C. Nascimento, and J. Belagiannis, "Deep learning for medical image segmentation: A review," Medical Image Analysis, vol. 70, p. 102035, May 2021. doi: 10.1016/j.media.2021.102035.

[4] A. Heydarian, "U-Net for Semantic Segmentation on Unbalanced Aerial Imagery," Towards Data Science, Oct. 3, 2021. [Online]. Available: https://towardsdatascience.com/u-net-for-semantic-segmentation-on-unbalanced-aerial-imagery-3474fa1d3e56/. [Accessed: 15-Mar-2026].

Tags: #loss-function #computer-vision #semantic-segmentation #medical #image-segmentation #python

Chia sẻ: