Full article: Less-supervised learning with knowledge distillation for sperm morphology analysis

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Sperm Morphology Analysis (SMA) is pivotal in diagnosing male infertility. However, manual analysis is subjective and time-intensive. Artificial intelligence presents automated alternatives, but hurdles like limited data and image quality constraints hinder its efficacy. These challenges impede Deep Learning (DL) models from grasping crucial sperm features. A solution enabling DL models to learn sample nuances, even with limited data, would be invaluable. This study proposes a Knowledge Distillation (KD) method to distinguish normal from abnormal sperm cells, leveraging the Modified Human Sperm Morphology Analysis dataset. Despite low-resolution, blurry images, our method yields relevant results. We exclusively utilize normal samples to train the model for anomaly detection, crucial in scenarios lacking abnormal data – a common issue in medical tasks. Our aim is to train an Anomaly Detection model using a dataset comprising unclear images and limited samples, without direct exposure to abnormal data. Our method achieves Receiver ROC/AUC scores of 70.4%, 87.6%, and 71.1% for head, vacuole, and acrosome, respectively, our method matches traditional DL model performance with less than 70% of the data. This less-supervised approach shows promise in advancing SMA despite data scarcity. Furthermore, KD enables model adaptability to edge devices in fertility clinics, requiring less processing power.

KEYWORDS:

1. Introduction

The high rate of male factor infertility has drawn the attention of scientists towards itself. Globally, around 70 million couples experience fertility-related issues, of which 30% are male factors, caused by environmental pollution, unhealthy diet, and obesity (Skoracka et al. Citation2020). An approach to this problem is In Vitro Fertilisation (IVF), one of several techniques available to help people with fertility problems. This technique removes an egg from woman’s ovaries and fertilises it with sperm in a laboratory. The fertilised egg, called an embryo, is injected into the woman’s womb to grow and develop. The quality and health of sperm are crucial for the success rate of the fertilisation process. So, an extra step of Intracytoplasmic Sperm Injection (ICSI) is required when a male cause of infertility exists.

In the process of ICSI, sperms are inspected by an expert. Then, a normal cell is selected using a tiny needle called a micropipette and injected into the egg’s centre. Considering the importance of chosen sperm health, there has been much research on sperm selection for artificial insemination. Sperm Morphology Analysis (SMA) contains the study of sperms’ characteristics. The manual SMA that an expert does is subjective, hard to learn, time-consuming, and unreliable, which brings attention to the automatic methods such as Ghasemian et al. (Citation2015) and Abbasi et al. (Citation2021).

In these methods, semen samples are gathered from patients, and images are taken using microscopes with high magnification. Using automatic SMA, sperms are labelled as normal or abnormal by the computer, in which normal sperms have a higher chance of fertility. Some properties including but not limited to shape, size, length, width, area, perimeter, and volume of the head, acrosome, vacuole, and tail that must be considered to diagnose the sperm as normal or abnormal.

In recent years, neural networks have demonstrated remarkable performance in solving intricate problem statements across various domains (Ghasemian et al. Citation2015; Javadi and Mirroshandel Citation2019; Abbasi et al. Citation2021). However, the state-of-the-art models often comprise millions of parameters, rendering them computationally intensive and impractical for deployment on resource-constrained devices used in daily life. Knowledge Distillation (KD) refers to the idea of compressing a model, and training a small network called student instead of a massive teacher network. The student network is trained to learn the behaviours of the teacher network. Certain KD methods rely on training data that encompasses both normal and abnormal samples, enabling the student to learn the intricate decision boundaries that delineate the respective classes. Alternatively, some approaches have explored learning the latent features of only normal samples, such as autoencoder-based methods for anomaly detection. These methods aim to reconstruct normal instances accurately while yielding higher reconstruction errors for abnormal samples, thereby facilitating their detection without the explicit need for abnormal training data (Salehi et al. Citation2021).

The proposed method is based on Anomaly Detection (AD), a task for identifying the items that seem abnormal to the model. In this approach, the model can be trained on both normal and abnormal samples, or in some cases, only on one type of samples. During the training phase, the model learns the latent features of the data it is exposed to. During the test time, the model can distinguish abnormal samples from normal ones based on the learned patterns, regardless of whether it has seen abnormal samples during training or not.

In our proposed KD method, we aimed to train a student network with the help of a pre-trained teacher network using only normal dataset samples. Doing so resulted in the student having close outputs to the teacher when given normal samples, without requiring abnormal samples during training. In this study, we acknowledged the significant challenges associated with acquiring comprehensive data on sperm morphology. The collection of such images was an intricate and labour-intensive process, often resulting in limited dataset sizes. Consequently, we proposed our less-supervised method that could achieve optimal performance while working with a relatively small amount of data and without relying on abnormal samples for training. Our approach involved training a model exclusively on normal samples to distinguish between normal and abnormal instances during inference. To mitigate the effects of data scarcity and enhance model robustness, we employed data augmentation techniques and adversarial attacks to generate synthetic samples. These techniques introduced controlled perturbations and variations to the training data, enabling the model to learn more transferable representations and improving its resilience to diverse input conditions. Our aim is to identify abnormal samples, preventing the inadvertent use of incorrect samples in assisted reproductive methods. As far as we are aware, our work represents the pioneering application of such a minimally supervised approach to SMA.

In this method, we have also focused on the used environments. Heavy Deep Learning (DL) models have outstanding performance, but their use in clinical environments is difficult and expensive. For this reason, we considered simple models for this work so that its implementation and training in these environments can be done at the lowest cost. In our research, the method of KD is used to train our model using the Modified Human Sperm Morphology Analysis (MHSMA) dataset (more explanation in Section 3). In summary, significant aspects of our proposed method are the following:

Proposed method is taught without using abnormal sperm images.
Less than 70% of the data used in other methods have been used.
Our model is trained less-supervised.
The final model is simple and small compared to other methods, which makes it possible to use it even on cell phones.
Data augmentation techniques and adversarial attacks were employed to generate synthetic samples, enhancing the model’s robustness.
We used the AD method, which works better with the imbalanced samples in our dataset compared to the classification methods.

The rest of this paper is organised as follows: in Section 2, the latest and essential research in SMA and AD are mentioned. In Section 3, we will explain and provide some examples of the used dataset, called MHSMA. Then, we will briefly explicate our KD method, the model we used for training, and several enhancements in the model training process in Section 4. In Section 5, the results of our experiments are demonstrated. Also, we discuss and compare our method’s performance to other state-of-the-art studies. Finally, we will conclude the mentioned cases and give a vision for the future of this field and applicable methods.

2. Related works

In this chapter, we first review the conventional and contemporary machine learning methods for SMA systems in Section 2.1. Then, we examine the research that has demonstrated superior performance using deep learning methods compared to traditional machine learning approaches in Section 2.2. Finally, considering that the scope of this article is less supervision, we review the semi-supervised learning approaches for SMA systems.

2.1. Machine learning-based model

Numerous published research on SMA has attempted to automate the categorisation of sperm based on different characteristics, such as fertilising ability or morphology.

Older studies relied highly on hand-crafted features. Li et al. (Citation2014) used Principal Component Analysis (PCA) and Scale-invariant Feature Transform (SIFT) to extract features such as colour, texture, and shape from the sperm images followed by the K-Nearest Neighbor (KNN) classifier. Shaker et al. (Citation2017) used a set of previously proposed features and a new set of features called elliptic features to classify the images into four classes of shapes (Normal, Tapered, Pyriform, and Amorphous) using a Linear Discriminant Analysis (LDA) classifier.

Earlier studies were towards Machine Learning (ML) approaches. These studies specified two steps of the analysis: automatic segmentation of sperm shapes and classification of sperms. Tseng et al. (Citation2013) presented a fused Support Vector Machine (SVM) using a one-dimensional feature extracted from the sperm’s contour with grey-level information. Shaker et al. (Citation2017) employed the Dictionary Learning technique, which is utilised to construct a dictionary of sperm head shapes. This dictionary is used to classify sperm heads into four different classes. Also, they have provided a dataset for SMA, namely Human Sperm Head Morphology (HuSHeM). This study aims to determine the types of existing abnormalities for diagnosis or research intents. According to World Health Organization (WHO), there are 11 types of abnormalities related to the head of the sperm (WHO Citation2021).

In another research (Chang et al. Citation2017), a gold-standard dataset called gold-standard for morphological sperm analysis (SCIAN-MorphoSpermGS) was introduced. This dataset consists of sperm head images with expert-classification labels in one of the following classes: normal, tapered, pyriform, small, or amorphous. They compared four supervised learning methods and three shape-based descriptors in their research. They showed that the Fourier descriptor and SVM method results in the best accuracy among others.

Ghasemian et al. (Citation2015) suggested an automatic algorithm primarily based on low-resolution image manipulation. Low-resolution image assessments are essential due to the limitation of using high-resolution microscopy in clinical settings. First, they accomplished a pre-processing step, including denoising and sperm region detection on images. Then they adopted these hand-crafted heuristics to detect malformation in the head, vacuole, and acrosome of sperm images. Furthermore, a new dataset called Human Sperm Morphology Analysis Dataset (HSMA-DS) is introduced in this study (Mirroshandel and Ghasemian Citation2018).

2.2. Deep learning-based models

DL methods have recently received much attention due to their remarkable capability to extract information directly from images. These methods can learn directly from the raw image, so they do not require separate steps. Javadi and Mirroshandel (Citation2019) proposed a convolutional deep neural network to detect abnormal samples. They achieved better results compared to previous classic ML techniques. A more modern method called Genetic Neural Architecture Search (GeNAS) (Miahi et al. Citation2022) proposed an algorithm that automatically finds the best Convolutional Neural Network (CNN) architecture and outperforms the hand-designed CNN architecture previously used. Javadi and Mirroshandel (Citation2019) and Miahi et al. (Citation2022) applied MHSMA dataset that is employed as a standard benchmark in SMA. This dataset consists of images labelled by experts for normal or abnormal and is introduced thoroughly in section 3.

Modern segmentation methods also found their way into the SMA. Recently Revollo et al. (Citation2022) showed that the segmentation method preserves shape without losing key morphological characteristics and the classification model based on morphological features yields better discrimination than traditional shape descriptors. Specialised CNN like U-net (Ronneberger et al. Citation2015) were used for the segmentation of human sperm head (Melendez et al. Citation2021). An improved architecture was introduced in (Lv et al. Citation2022) by integrating the dilated convolution. Lv et al. (Citation2022) have demonstrated that the Hybrid Dilated Convolution (HDC) module has a noticeable improvement in segmentation outcomes.

Ilhan et al. (Citation2020) also achieved better results using the Mobile-Net, a very convenient network for smartphones, than classification using the domain-specific features extracted by wavelet transform and descriptors (Ilhan et al. Citation2018a, Citation2018b). The full version of Sperm Morphology Image Data Set (SMIDS) (ILHAN et al. Citation2022) was first introduced and used in (Ilhan et al. Citation2020) conventionally.

Transfer learning has been mainly adopted in the SMA field. This method shows promising results when addressing small datasets in SMA. Riordon et al. (Citation2019) demonstrated a DL method to classify sperm images from two public datasets (HuSHeM and SCIAN- MorphoSpermGS) into four different categories in terms of the WHO standards. Abbasi et al. (Citation2021) showed that a multi-task transfer learning approach could outperform the previous DL methods. They used the Visual Geometry Group (VGG) network (Simonyan and Zisserman Citation2014) pre-trained on the ImageNet (Deng et al. Citation2009) dataset and finetuned it on the MHSMA dataset. They also presented Deep Multi-Task Learning (DMTL) for the first time in the field of SMA. This DMTL technique can classify all three parts of sperm (i.e. head, vacuole, and acrosome) in a single prediction.

Liu et al. (Citation2021) adopts the feature extraction architecture of AlexNet (Krizhevsky et al. Citation2017) and its pre-trained parameters to automatically classify the sperms by analysing their morphology. Their proposed method exceeds the performance of previous algorithms on the freely-available HuSHeM dataset (Shaker et al. Citation2017). Chandra et al. (Citation2022) employed the widely-used DL models, which have attained good classification accuracy on the ImageNet dataset, to extract the features and classify the samples into normal and abnormal.

2.3. Less-supervised learning-based models

Due to the difficulty of microscopic imaging, a considerable amount of expert-labelled training data in sperm morphology is out of reach. AD approaches are referred to solve this problem using unsupervised and semi-supervised techniques.

AD methods have recently received much attention and have obtained acceptable performance compared to fully supervised methods. Andrews et al. (Citation2016) showed that a standard transfer learning approach or a CNN model trained on an auxiliary task could present viable representations for AD without explicitly assessing an introductory on the abnormal data.

Generative Adversarial Network (GAN) methods like F-anogan (Schlegl et al. Citation2019) or Ocgan Perera et al. (Citation2019) attempt to find a specific latent space where the generator’s reconstructions, obtained from samplings of this space are analogous to the normal data. These approaches failed in the one-class setting where the anomalies represent a broadly different class from the normal samples.

Lately, Salehi et al. (Citation2021) demonstrated a KD approach that solved the problem of failure in precise anomaly localisation or needing expensive region-based training, which was found in previous studies. Zhang et al. (Citation2022) addressed the problem of relying on limited and possibly noisy class labels by using an unsupervised approach to extract valuable features without additional labelling cost on SCIAN-MorphoSpermGS and HuSHeM datasets.

3. Dataset

The dataset that is used in this research is ‘MHSMA: The Modified Human Sperm Morphology Analysis Dataset’ which is collected from 235 patients with male factor infertility and consists of 1,540 greyscale samples for Acrosome, Head, Vacuole, and Tail, individually. In MHSMA, each sample is available in two different crop sizes of 64 × 64 and 128 × 128 and is labelled by experts as normal or abnormal. The two sizes of one sample image are shown in . In , we present sample images and corresponding labels from the MHSMA dataset.

Figure 1. An example of the images in MHSMA dataset. Both images represent the same sperm. One is 128x128 pixels, and the other one is cropped 64x64 pixels.

Figure 2. Images and labels of MHSMA dataset for different parts of sperm such as vacuole, tail, head, and acrosome.

During the ICSI performance, images of spermatozoa were captured at x400 and x600 magnification using a microscope equipped with a Charge-Coupled Device (CCD) camera with chromatic infinity objective lenses (Ghasemian et al. Citation2015).

The number of positive (normal) and negative (abnormal) samples for each category is described in .

Table 1. Number of positive and negative samples in MHSMA dataset. There are 1,540 sperm images in the dataset labelled as normal or abnormal.

Download CSV Display Table

These samples were divided into groups of 1000, 240, and 300 for training, validation, and testing, respectively.

4. Proposed method

In this study, we use the KD technique instead of the standard classification methods previously used in SMA. The images were randomly augmented to enlarge the dataset, and an adversarial attack was applied to a random number of images in each batch to improve the robustness of the model. The augmented and adversarial images were fed to the teacher and the student deep neural networks in our proposed approach. The KD approach was boosted by intermediate learning, which refers to the process of utilising the outputs of different intermediate layers of the teacher network, in addition to the final output, to guide the training of the student network. By matching the student’s representations at these intermediate layers to those of the teacher, the student can learn to capture more general features extracted by the teacher network. This allows the student to go beyond simply replicating the final output and gain a deeper understanding of the underlying representations learned by the teacher. Intermediate learning is particularly beneficial when working with small datasets, as it enables the student to leverage the teacher’s knowledge more effectively, leading to improved performance. As a result, we used a smaller student network to improve the performance of the method. We computed the loss, which was the sum of differences between the layers of the student and the teacher networks, including the intermediate layer outputs. We computed the gradients, applied backpropagation on the student network, and updated its weights. Finally, we detected anomalies based on the loss value and a threshold hyperparameter. This procedure helped us to achieve relevant results using low-resolution and blurry images, which is a challenging task to analyse with traditional methods. shows the described procedure.

Figure 3. The AD procedure in our proposed method. This flowchart shows how we determine whether a raw image is normal or abnormal.

In the KD approach, there were two neural networks: the student network and the teacher network. The teacher network was first pre-trained on the ImageNet dataset, a large, general dataset that did not contain any samples from our primary sperm image dataset. This pre-training on ImageNet allowed the teacher network to learn generalisable visual representations and gain a broad understanding of image features, without being explicitly trained on our specific task or dataset. After pre-training, the teacher network’s parameters were frozen, ensuring that feeding images during the subsequent training process would not change its learned representations.

There are three types of KD based on distillation schemes: offline, online, and self-distillation (Gou et al. Citation2021). In the offline method, a teacher model is trained in advance and only guides the student during training (Hinton et al. Citation2015). In the online method, both teacher and student models are trained end-to-end. The self-distillation method is a type of online distillation where the same networks are used for the teacher and the student (Zhang et al. Citation2019). As the proper efficiency of existing pre-trained models proved in later research, we used the offline method in this study. In , we summarised our proposed approach.

Figure 4. Visualized summary of our method, an offline distillation approach. LSn is the $n th$ layer of the student network and LTn is the teacher one. Also, this figure roughly shows which layers we use as critical layers.

After training the student model based on the total loss obtained from the sum of the losses between the corresponding layers in the student’s and teacher’s network, we use the difference between the teacher and student losses to distinguish between positive and negative samples. In the samples where the sperm is normal, the loss value for the two networks is close to each other because the student network recognises this sample from training. shows the vectors of positive and negative samples.

Figure 5. Distance vectors for normal and abnormal samples. This figure shows the distance of several samples of normal and abnormal sperm that we detect anomalies based on that. This graph is obtained from the actual data of this dataset after training the model.

In general, the training process involved the following steps:

A dataset $D_{train} = {x_{1}, x_{2}, \dots, x_{n}}$ was created from the available positive samples (normal sperm images).
During the KD phase, the positive samples from $D_{train}$ were fed into both the frozen, pre-trained teacher network and the student network.
The student network was trained to match the outputs and intermediate representations of the teacher network on these positive samples from $D_{train}$ . This way, the student learned to mimic the teacher’s understanding of normal samples without being explicitly trained on abnormal samples.
During inference, when a test sample was given to the student network, its output and intermediate representations would match those of the teacher for normal samples but deviate for abnormal samples. This deviation, measured as the difference between the student and teacher representations, indicated the abnormality of the test sample.

In small datasets, training a deep neural network can be challenging due to the limited amount of data available, leading to overfitting and poor generalisation. One promising approach to improve the efficiency and effectiveness of KD in smaller datasets involves leveraging the output of the middle layers of the teacher network, which can provide more detailed and instructive information about the underlying data. Intermediate learning can improve the performance of KD by gradually increasing the task’s difficulty for the student model. Using more layers of knowledge from the teacher network makes the student network learn from basics to advanced features. Also, intermediate learning can help the student model focus on the most critical information, making the distillation process more efficient. Additionally, using intermediate learning can help mitigate the effects of overfitting and reduce the amount of required training data. This is particularly important in our dataset, where the limited amount of data makes it challenging to achieve high levels of accuracy.

Previous KD studies focus on the resemblance of outputs, which ultimately causes the student not to inherit the teacher’s knowledge (Hinton et al. Citation2015). As we know, in a convolutional network, the first and middle layers detect the low-level and base features of the image. For example, the first layer of a convolutional network extracts the edges of an image, and the last layer extracts the high-level and meaningful features by combining the low-level features.

For the first time, Romero et al. (Citation2014) showed that using a hint in thin networks in which the student tries to simulate the behaviour of it can outperform the previous methods. In that study, the middle layer’s output from the teacher network, which represents the image, is mapped to a layer in the student using additional parameters. Salehi et al. (Citation2021) pointed out that using multiple layers as a hint increases the effectiveness of this method. Chen et al. (Citation2021) also proposed a method that automatically assigns proper target layers of the teacher model for each student layer with an attention mechanism. In this study, we tried to solve this issue by transferring information from the mid-layers of the teacher to the student. As a result, the model can perform better in discrimination.

In the following, we present a comprehensive study on KD technique, focusing on improving performance through different network architectures, a robust loss function, and data augmentation techniques. We investigate the effectiveness of different architectures for student networks and explore the impact of varying their depth and width. Additionally, we proposed a loss function with all the enhancements for better intermediate learning. Lastly, we investigate the effect of data augmentation and improve the robustness of the model by using adversarial attacks.

4.1. Architecture

The student-teacher models’ architecture design is critical for efficient knowledge acquisition. The student network is a quantised version of the teacher model. Since the student could learn excessive information focused on non-distinguishing features, the student architecture is relatively smaller and more uncomplicated than the teacher network.

4.1.1. The teacher network

We used VGG (Simonyan and Zisserman Citation2014) architecture for the teacher network. Transfer learning and VGG showed promising results in classifying anomalous and normal data (Abbasi et al. Citation2021). For the teacher network in our KD approach, we employed the VGG16 network architecture pre-trained on the ImageNet dataset, with its weights frozen after pre-training. While ImageNet contains a diverse range of natural images, there exists a domain shift when applying the pre-trained model to the specific task of SMA. To mitigate this issue and adapt the VGG network to our domain, we employed the aforementioned training steps for the student network. During training, the weights of the teacher VGG16 network were frozen, and only the student network’s weights were updated using the KD process.

We evaluated different variants of the VGG architecture, such as VGG16 and VGG19, and the frozen VGG16 network demonstrated superior accuracy on a held-out validation set of sperm images, making it a suitable teacher model candidate. The VGG architecture, with its sequential convolutional and fully-connected structure, aligned well with our KD method. The progressive feature extraction and hierarchical representations learned by the VGG network facilitated the mimicking of intermediate representations by the student network. In contrast, other models with more complex architectures involving skip connections or parallel branches may have introduced additional complexities in the distillation process, making it harder for the student to effectively replicate the teacher’s internal representations.

Furthermore, the VGG network’s design simplicity and relatively lower computational requirements made it more practical for deployment in resource-constrained clinical settings compared to larger and more computationally intensive models. The KD approach allowed us to leverage the representational power of the pre-trained VGG network while training a smaller and more efficient student model tailored for our task.

We define each block in the network as a set of convolution layers and a pooling layer at the end. As shown in , the VGG16 network has five blocks. Each block contains two or three two-dimensional convolution layers with a Rectified Linear Unit (ReLU) activation function. The convolutions have three pixels ( $3 \times 3$ ) kernel size and one pixel ( $1 \times 1$ ) padding and striding. Also, they all have a two-dimensional Batch Normalisation.

Figure 6. The architecture of the VGG16 network encoder (without the linear layers). This network is used as a teacher network in our method.

4.1.2. The student network

KD has emerged as a popular technique for compressing large neural networks into smaller and more efficient models in recent years. However, the effectiveness of this technique depends heavily on the choice of the student network architecture. In this paper, we investigate the impact of network depth on KD performance for low-quality image datasets. Our experimental results demonstrate that simpler networks can achieve better results than deeper networks, particularly when dealing with low-quality images. This is because deeper networks may extract features that are not relevant to the underlying data, leading to poor generalisation. To address this issue, we explore different architectures for the student networks and evaluate their performance in terms of simplicity and efficiency. Overall, our findings suggest that a careful balance between simplicity and efficiency is crucial for achieving optimal performance in KD.

One of the goals of KD is simplifying the student network, resulting in reduced calculation cost, memory usage, and runtime. In many cases, DL models should have the most negligible overhead for the system. In previous studies, the same student and teacher network size is explored (Zhang et al. Citation2018). Experiments showed that teacher networks could effectively transfer knowledge to students to a specific size, not smaller and not bigger (Mirzadeh et al. Citation2020). Intense and close-to-teacher networks performed deficiently due to concentrating on unsuitable features. Also, very shallow networks did not learn the minimum required features to distinguish the anomalous sample from the normal.

For the student network, we strained different types of networks. To modify the structure of the student network, we altered the number of layers and the number of filters in each layer with distinct values. We used a similar block-based network for the student with fewer filters and layers than the teacher network. Each block contains one or two convolution layers alongside pooling layers as feature extractors. The last convolution layer before each pooling layer has the same number of filters as the teacher network to simplify the knowledge transfer.

Generally, we proposed three network architectures. Model A is the simplest model that does not converge well, which is shown in . As mentioned above, the teacher and student models’ gap capacity is confined, so smaller networks do not ensure better results. Model B is the best trade-off between simplicity and performance. The proposed architecture for the student network and various settings is illustrated in . In the student network, the hyperparameters of the convolution layer, MaxPool, and BatchNorm are the same as the teacher network described above. Model C has the same architecture as the teacher network explained in subsection 4.1.1. Low-quality data and the distillation of knowledge affect the model architecture, and simpler models produce better results than the same network size as the teacher. The results are provided in Section 5.

Figure 7. Model A: a simple network proposed to analyze the distillation effect. This network is used as a student network in our method.

Figure 8. Model B: the architecture of proposed student network. This network is used as a student network in our method.

Models like VGG facilitate knowledge transfer between the student and teacher layers with a block-based structure without needing extra parameters. We used multiple one-to-one hints for the student network to learn from the middle layers of the teacher’s network. Specifically, we used the last four MaxPool layers of the teacher network to transfer the knowledge to the student network. We skipped the convolution layers between the Maxpool layers because the pooling layers have all the knowledge we want from the block it belongs to in the networks. The effect of this method, namely intermediate learning, is shown in Section 5.

4.2. Loss

As mentioned earlier, the student network tries to imitate the teacher network, so the loss function is the difference between the teacher’s and student’s output in specific layers. The convolutional layer’s activation output and each Maxpool are the set of all critical points and have the necessary knowledge to transfer to the student network. We will choose some of these critical points in the training phase as a hint to converge better. We refer to the $n th$ critical layer’s output as $C P_{i}$ and the activation output of the named layer in the student network as $a_{s}^{C P_{i}}$ and the teacher’s ones as $a_{t}^{C P_{i}}$ . $N_{CP}$ represents the total number of critical layers. We define two losses that try to approximate the outputs: Cosine Similarity (CS) and Mean Squared Error (MSE). The first one is CS, defined below:

(1)

F_{CS} = \sum_{i}^{N_{CP}} \frac{{(a_{t}^{C P_{i}})}^{T} \cdot (a_{s}^{C P_{i}})}{∥ a_{t}^{C P_{i}} ∥ ∥ a_{s}^{C P_{i}} ∥}

(1)

We consider the output of the models as a vector in the space. For this reason, the smaller the angle between two vectors, the more similar the outputs will be. This function is more vital than the MSE function in ReLU networks, whose neurons are activated only after exceeding a zero value threshold. The loss function should be the difference of the vectors, so we assume the $L_{CS}$ function as below:

(2)

L_{CS} = 1 - F_{CS}

(2)

The second one, $L_{MSE}$ , aims to minimise the Euclidean Distance between activation values of networks and is defined beneath.

(3)

L_{MSE} = \sum_{i = 1}^{N_{CP}} (a_{t}^{C P_{i}} - a_{s}^{C P_{i}})^{2}

(3)

Using the two functions mentioned above, $L_{TOTAL}$ is formulated as

(4)

L_{TOTAL} = L_{CS} + λ L_{MSE}

(4)

$λ$ is a weight used to match the scale of the two losses. By adjusting the $λ$ value, the model can be tuned to emphasise one loss function over the other. This can enhance the model’s performance by addressing specific issues or biases in the training data. There are several ways to determine the value of $λ$ as a scaling factor to balance multiple loss functions, such as trial and error, grid search, and random search. We tested several values between zero and one for this hyperparameter, and the results are given in Section 5.

In the following subsection, we have explained the problem of the need for more training data, which has been solved with data augmentation.

4.3. Improving model robustness

Data augmentation and adversarial attack were employed to improve the performance of KD models in our research. We used techniques such as rotation and flipping to increase the size of the dataset. These techniques assist the model in learning more invariant characteristics and enhancing its ability to generalise to new unseen data. To further improve the model’s robustness, we also employed adversarial attacks. We generated adversarial examples using the Fast Gradient Sign Method (FGSM) and trained the model with varying attack strengths. In general, these techniques aid in improving the efficiency and robustness of the KD model and could be useful for real-world applications.

4.3.1. Data augmentation

We used some augmentation techniques to enlarge the dataset virtually. Deep neural networks require a large amount of data for training. The dataset we used has relatively little data. Also, we do not use data with negative labels in training; we lose 30% of the usable data by removing these samples. Generating these samples is also a difficult and time-consuming task. For this reason, we used augmentation methods to increase the number of examples to train the model. This study has used several methods to increase the actual data and formation. Each of these methods is randomly applied to various samples in each epoch.

Flip: We flip the image horizontally and vertically with a chance of 0.5, as shown in .

Figure 9. Example of a sperm image in the dataset with flip augmentation applied during the training phase.

Rotate: We rotate the images in a $[0, 360]$ range with a chance of 0.5, as shown in .

Figure 10. Example of a sperm image in the dataset with rotate augmentation applied during the training phase.

The dataset has been normalised with std = 0.04 and mean = 0.5, computed directly from the raw image data on the training dataset, and applied to all datasets.

(5)

normal (IMG) = \frac{IMG - mean (IMG)}{std (IMG)}

(5)

It should be noted that data augmentation is done only on the training set – validation and test sets are not augmented. Also, we intentionally did not use some augmentation method so as not to spoil the morphological features in the image.

4.3.2. Adversarial attack

Adversarial attacks are used to improve the resilience of a model by adding minor, carefully crafted perturbations to input data in order to deceive the model into generating inaccurate predictions. Knowing the vulnerabilities of a model can help improving the robustness, and reducing the chance of misclassifications where an unsubstantial small change in input may deceive the model, or when an attacker may attempt to modify the model’s predictions intentionally.

An adversarial sample refers to a specially crafted sample that looks normal to human eyes but causes misclassification in the model. In other words, the minimum changes needed in a sample are to make the model misclassify. Using this technique, it is possible to increase the perception of the model by training it using the adversarial crafted samples.

The FGSM is a white box adversarial attack (Goodfellow et al. Citation2014). This noise is calculated by multiplying the sign of the gradient concerning the image we want to perturb by a small constant epsilon. Increasing the epsilon makes the image more likely to fool the model and makes the noise more visible. The noise is formed based on model parameters, so given the same input, noises might differ in every epoch.

In this attack, the input image is first given to student and teacher models, and the loss value is calculated according to the equation mentioned in subsection 4.2. The gradient vector is calculated based on the loss function in this step. The image of a sample, along with the calculated gradient vector, can be seen in .

Figure 11. Gradient vector calculated from an image. This vector is used to add an adversarial attack to the image based on the value of gradients.

Next, we will introduce the $sign$ function to add an attack to the image.

(6)

sign (vec) = for each x in vec (\begin{matrix} - 1 & if x < 0 \\ 1 & if x \geq 0 \end{matrix}

(6)

Finally, with an $eps$ factor that in this study is considered to be 0.04, we change the final value of the $Input$ image according to the $sign$ of $gradient$ calculated in the backward procedure.

(7)

Image = Input + eps * sign (gradient)

(7)

This action makes the model learn with higher confidence and prevents unwanted features with the help of noisy samples. In , the steps of an FGSM attack can be seen.

Figure 12. The steps of creating an attacked image from a normal image. Summing up the image with the epsilon portion of the sign of the gradients creates an attacked image.

In our study, we applied adversarial attacks on a random number of images in each batch during the training phase. We did not apply an attack on the validation and test set. Our experimental results showed that the proposed model achieved improved performance using this approach compared to the baseline model. The results are available in Section 5.

5. Results

In this section, we present the results of our proposed method based on KD for SMA. As discussed in the introduction, manual SMA is a time-consuming and subjective method. The evolution of artificial intelligence has led to the creation of automatic techniques and algorithms, which have their difficulties and constraints, such as limited data and low-resolution images. Our proposed method attempted to overcome these challenges and train a model for AD utilising a dataset with unclear images and a limited number of samples without observing any anomalous data. Even with a limited dataset, our results demonstrate that the KD method may be an effective solution for SMA.

Several metrics are often used in this task. These include precision, recall, F-score, accuracy, and Receiver Operating Characteristic (ROC)/Area Under Curve (AUC). However, ROC/AUC is the most common evaluation metric. In other words, this metric provides more information about this specific task. Accordingly, we have used ROC/AUC as our evaluation measure to compare our work with more related works. We used the ROC and AUC metrics to provide a more detailed analysis of the performance of the models. The ROC/AUC is a scalar metric that represents the model’s overall performance, with a value of 1 indicating a perfect model and a value of 0.5 indicating a model that performs no better than random guessing. The higher ROC/AUC indicates that the model better distinguished between positive and negative samples. ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds, while the AUC summarises the ROC curve by calculating the area under it. The ROC and AUC can be can be expressed as follows:

(8)

ROC : TPR = \frac{TP}{TP + FN} FPR = \frac{FP}{TN + FP}

(8)

(9)

AUC = \int_{0}^{1} TPR (FP R^{- 1} (t)) dt

(9)

Where True Positive (TP) and True Negative (TN) represent correct predictions made by the model, while False Positive (FP) and False Negative (FN) represent incorrect ones. We also report the $F_{0.5}$ , Precision, Recall, and Matthews Correlation Coefficient (MCC) to compare our results to other results on this dataset. Precision is a measure of the proportion of TP predictions in the total number of positive predictions, while recall is a measure of the proportion of TP predictions in the total number of actual positive examples. The $F_{0.5}$ score is a weighted harmonic mean of precision and recall, where precision is given more weight than recall. These metrics can be expressed as follows:

(10)

Precision = \frac{TP}{TP + FP}

(10)

(11)

Recall = \frac{TP}{TP + FN}

(11)

(12)

MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}

(12)

(13)

F_{0.5} = \frac{(1 + {0.5}^{2}) \cdot precision \cdot recall}{{(0.5}^{2} \cdot precision + recall)}

(13)

In addition to evaluating the performance of the models using ROC/AUC as the primary metric, we also analysed the effect of the threshold hyperparameter on the precision and recall metrics. The following equation shows how we mark an image as an anomaly:

(14)

is Anomaly (image) = (\begin{matrix} True & if loss Value > threshold \\ False & otherwise \end{matrix}

(14)

The $loss Value$ is the output of the loss function described in subsection 4.2.

The results showed that as the threshold was increased, the precision of the model increased while the recall decreased. On the other hand, as the threshold was decreased, the model recall increased while the precision decreased. This indicates a trade-off between precision and recall, and the optimal threshold will depend on the specific application and the desired balance between these two metrics. High precision is critical in SMA because FPs can lead to misdiagnosis and unnecessary treatment, which can be costly and emotionally distressing for patients. A low recall may not be as critical as it may lead to missing a few abnormal sperm which might not significantly impact the fertilisation process. For this reason, we attempted to optimise the precision by configuring the threshold hyperparameter.

5.1. Settings

The bias terms could easily guide deep neural networks to learn any constant function. The hypersphere collapse results from a constant function mapping directly to the hypersphere centre. So we did not use the bias terms for convolution layers in our student network.

In all of our experiments, we used the Adam optimiser. We systematically optimised key hyperparameters including learning rate, batch size, and epsilon to identify values that maximise validation performance. A grid search was conducted across learning rates ranging from 1e-4 to 1e-1 on a logarithmic scale, batch sizes from 8 to 512 in powers of 2, and epsilon values from 0 to 0.5 in 0.01 increments. The model was trained for 200 epochs on each hyperparameter combination, and the values yielding the lowest validation loss were selected. The best performance is achieved using learning rate = 0.001, batch size = 128, and the epsilon parameter for the FGSM attack equals 0.04. This combination of learning rate, batch size, and epsilon was used for all subsequent experiments.

We used the Glorot and Bengio (Citation2010) approach to weight initialise the student network. All the models were trained individually for the head, acrosome, and vacuole for 200 epochs. The threshold hyperparameter is calculated after the training phase of each experiment on the validation set.

5.2. Experiments

Initially, we trained the basic model without preprocessing and obtained a subpar result. The dataset was then normalised and augmented at the start of each epoch, which substantially increased the ROC/AUC. In addition, our metric value was marginally improved by employing a FGSM adversarial attack on the dataset to aid the model in learning latent features. We utilised the output of four critical layers to calculate the loss, with $λ$ set to $0.01$ . The student model was trained using the training set. The images from the validation set are then fed into the model to determine the optimal threshold value.

In our quest to establish the most effective threshold for our model, we aimed to strike a balance between precision and recall. To achieve this balance, we selected the threshold value that maximised the $F_{0.5}$ score. The $F_{0.5}$ score weights precision twice as much as recall, allowing us to tune our model to prioritise precision over recall. The high precision indicates that our model returns very few false positives, at the expense of missing some true positives. However, for our application, minimising false positives is critical so the model’s precision was paramount. The $F_{0.5}$ score allowed us to systematically select the threshold that struck the right balance between precision and recall, with an emphasis on precision, to meet the needs of our model. The optimal thresholds for the head, vacuole, and acrosome are $0.77$ , $0.70$ , and $0.67$ , respectively.

Finally, the metrics for the test set were measured. displays the results of applying the same procedure to other parts with varying outcomes.

Table 2. The outcomes of each part of sperm produced by the proposed method. The optimal setup of the suggested approach on the test set yielded these results.

Display Table

A confusion matrix is an effective evaluation tool for AD systems. In our study, we analysed the performance of our suggested method and computed the aforementioned metrics on the test set using the confusion matrix. The confusion matrix shows the outcomes of our model’s predictions clearly and plainly, allowing us to evaluate the accuracy of our method. Our model’s confusion matrix is shown in .

Table 3. Confusion matrix of the proposed method for each part of the sperm. These results are achieved from the evaluation phase on the test set.

Download CSV Display Table

In our experiments, we evaluated the performance of KD and compared it to other techniques for training DL models. The results of this experiment will provide insights into the effectiveness of KD when working with smaller datasets. We entirely ignore the abnormal samples in the training phase, too. The performance comparison of our proposed method with Ghasemian et al. (Citation2015) and Javadi and Mirroshandel (Citation2019) are shown in . Our method has outperformed Ghasemian et al. (Citation2015) significantly. Furthermore, it can be seen that Javadi and Mirroshandel (Citation2019)’s approach has higher result values than our approach. This is acceptable due to the dataset limitation we have imposed in our method, which is training the model without using any abnormal data.

Table 4. Comparison of our results with those achieved by Ghasemian et al. (Citation2015) and Javadi and Mirroshandel (Citation2019) on each part of sperm evaluated on the test set of the MHSMA dataset.

Display Table

We found that while KD can be an effective technique for improving the performance of DL models, it did not outperform the fully supervised method in our experiments. However, KD allows the model to learn from a teacher model’s generalisation capabilities which can be particularly beneficial when the data for training is limited. Additionally, it allows for a softer target distribution which can provide more information for the student model to learn from. Moreover, KD can be used to train smaller models, which is beneficial with limited computational resources.

An Analysis of Variance (ANOVA) was conducted to compare the performance of our proposed method against that of the existing method by Javadi and Mirroshandel (Citation2019). Initially, we implemented the approach outlined by Javadi and Mirroshandel (Citation2019) and obtained predicted outputs using their method on our dataset. Similarly, our proposed technique was utilised to make predictions on the same dataset, and the sample mean total loss, $L_{TOTAL}$ , was analogously computed. A boxplot visualising the distribution of sample means for both groups is provided in .

Figure 13. Boxplot of total loss mean for two groups. This plot indicates that the mean of the two groups is different.

The null hypothesis for the ANOVA is that the population means of total loss are equivalent between the two sample groups. For head and vacuole, the ANOVA produced a p-value below the 0.05 significance level, indicating that the difference in sample means between the two groups is statistically significant. We therefore reject the null hypothesis. The ANOVA provides evidence that the population means differ statistically between the two methods. However, for the acrosome, we accept the null hypothesis. The result of the test is shown in .

Table 5. ANOVA test results for comparison of our results with those achieved by Javadi and Mirroshandel (Citation2019).

Download CSV Display Table

Another useful experiment is generating an Anomaly Localisation Map. An Anomaly Localisation Map is a method for identifying and highlighting anomalous or abnormal regions of an image or data sample. The map is generated by applying a trained model to the input data and highlighting the regions of the image in which the model’s output deviates significantly from the expected output. The output is a heatmap indicating which image regions most likely contain anomalies.

To generate a localisation map, we used the Guided Backpropagation (GBP) technique introduced in Springenberg et al. (Citation2014). In GBP, the gradient of the network’s output concerning the input is used to highlight the input regions that had the most influence on the final prediction. Also, we added a Gaussian filter to the GBP output to reduce the effect of existing noises from the final image.

In GBP, the gradients are modified to propagate only the positive gradients. This is done by setting all negative gradients to zero before they are used to update the weights. The positive gradients can be represented by the ReLU function, which is defined as $ReLU (x) = \max (0, x)$ and is used as the activation function in many neural networks. In the , we have shown three examples of abnormal data, which show the pixels in the image that the model pays more attention to and in which part it shows the abnormality.

Figure 14. Anomaly localization map on abnormal samples from the test set. This map shows which parts of the image the model paid more attention to.

In our experiments, we investigated the impact of the $λ$ hyperparameter in our proposed loss function (EquationEquation 4(4) $L_{TOTAL} = L_{CS} + λ L_{MSE}$ (4) ) on the accuracy of our DL model.

In developing our dual loss function, we needed to determine the optimal weighting between the CS and the MSE. To find the right balance, we first calculated the mean of the cs and MSE loss outputs individually on the validation set. With these mean loss values as a reference, we then tested various values for the $λ$ parameter that controls the weighting between the two loss terms. Specifically, we evaluated $λ$ values of 0.1, 0.01, and 0.001. For each $λ$ value, we re-trained the model and quantified the overall validation loss. When the mean squared error loss is given too much weight ( $λ > 0.1$ ), the student network tends to focus more on minimising the pixel-wise differences between its output and the teacher’s output, rather than aligning the directional representations. This can lead to a sub-optimal learning process, as the student may learn to replicate the teacher’s outputs without fully capturing the underlying semantic representations. Through our empirical evaluations, we found that $λ$ values within the range of 0 to 0.1 strike an appropriate balance between the two loss components.

Our findings suggest that using $λ$ as a scaling factor to balance multiple loss functions can effectively improve the performance of our proposed model. The results of multiple $λ$ values are shown in . We altered the value of the threshold and reported it in each scenario as changing the $λ$ would impact the output values of the loss function. In this experiment, we chose the threshold value by maximising the $F_{0.5}$ score to balance precision and recall with a focus on precision, as described above.

Table 6. Results achieved with distinct $λ$ values on validation set. Different $λ$ values can balance the effect of multiple losses.

Display Table

While the results show that different defect types (head, vacuole, and acrosome) achieve their highest F-scores at different optimal $λ$ values, we selected a $λ$ of 0.01 for our final model. Although having separate $λ$ values for each defect type could potentially maximise individual F-scores, we opted for a single value of 0.01 for simplicity, generalisation, computational considerations, and ease of interpretability. This value achieved reasonably high F-scores across all defect types, striking a good balance in overall performance. However, we acknowledge the importance of this finding and plan to investigate more sophisticated techniques in future work to dynamically adjust $λ$ based on input samples or defect types, aiming to combine optimal individual performance with a practical and generalisable model. The $λ$ of 0.01 indicates a 1:100 weighting between the MSE and CS terms is optimal for our application. By tuning $λ$ in this manner, we were able to introduce the benefits of MSE for feature space learning while retaining the similarity properties of CS. Our dual loss function with the optimised $λ$ parameter yielded improved performance compared to models trained on either CS or MSE alone.

Using a simpler model is an integral part of our analysis. The architecture selected for the student network significantly affects the performance of distinguished feature training. Also, this model is very compatible with the essence of low-resolution microscopic images, improving the model’s accuracy on this dataset. This study shows that simplifying the model affects the MHSMA dataset and delivers better performance. shows the number of trainable parameters for each student network introduced in the previous section. shows the results of using these networks evaluated on the validation set. As is visible in this figure, the best results were achieved utilising model B.

Figure 15. The result of using different architectures for student networks. Model a has a simple architecture. Model B has the proposed student architecture. Model C has the same architecture as the teacher network.

Table 7. Number of trainable parameters in three proposed models to show the effect of distillation. Changing the architecture of each network changes the number of its trainable parameters.

Download CSV Display Table

We report various metrics for considering the different numbers of critical layers, including in the loss function. The result shows that considering more layers in the loss function could improve performance. However, using more than four layers does not affect the accuracy anymore. shows the results of the evaluation of each part of the sperm on both validation and test sets. As mentioned previously, the threshold value utilised in this study was selected by optimising the $F_{0.5}$ score on the validation set.

Table 8. Effect of critical layers on the performance of the proposed method. These results show how changing the number of critical layers in loss calculation can affect performance.

Display Table

In this experiment, we also study the effects of augmentation and adversarial attacks. Data augmentation involved generating extra training data by performing various image transformations, such as rotation and flipping, to the original images. The adversarial technique consisted of perturbing the source images with unnoticeable noise in order to trick the model into generating inaccurate predictions, hence compelling it to learn more robust features. We report AUC score, precision, recall, and $F_{0.5}$ score on three modes: (1) without augmentation and adversarial attack, (2) with augmentation but without adversarial attack, and (3) with both augmentation and adversarial attack. shows the details of all metrics with and without these techniques on both validation and test sets. We discovered that data augmentation and adversarial attack increased our model’s AUC score. It has also positive effects on precision measure.

Table 9. Model performance metrics comparison with and without data augmentation and adversarial attack.

Display Table

These results demonstrate that the proposed approach with data augmentation and adversarial attack can significantly improve the model’s performance in detecting anomalies in low-quality sperm images.

The constraints within our work arise from various factors. Firstly, the images within our dataset exhibit blurriness and lack clarity due to limitations in equipment, posing a significant challenge to our analysis. Additionally, ethical considerations restricted the availability of data, thereby complicating the scope of our experiment. Despite these acknowledged limitations, our approach produced acceptable results, underscoring the significance of this methodology.

6. Conclusion

This paper presents a new method to extract and categorise anomalous data in SMA. Our model detects and classifies sperms into normal and abnormal categories without considering the abnormal data. We showed that we could train a student to learn from anomaly-free data using a knowledgeable teacher in another field (i.e. trained on the ImageNet dataset). The student network can distinguish the abnormal data not seen in the training phase from the normal data using an intermediate knowledge of the expert network.

In this article, we transferred the knowledge of a pre-trained teacher network to the student network using the feature-based offline KD method. Using several intermediate layers helps the student network learn specific features using only positive samples. In the subsequent analysis, the teacher network’s extensive understanding of both sample types leads to a distinct encoding of images by the two models: differing for negative samples and converging for positive samples. Based on this difference, abnormal samples can be distinguished from normal ones.

This method has performed well in comparison to fully supervised methods. Due to the absence of abnormal data in training, it can be highly competitive with existing methods in terms of precision, F score, accuracy, and other valuable metrics in this field. The use of simple models has advanced this field in edge devices. In addition, it functions with high accuracy for low-quality data obtained in a laboratory environment with inadequate imaging facilities. Also, Imaging the sperm on this scale is very difficult. For this reason, we attempted to train the model with as little data as possible and without the requirement for expert-labelled data.

In this field of medication, especially SMA, where imaging and labelling are challenging, models can be trained without needing labelled data with the help of unsupervised and self-supervised methods. Also, new GAN-based data creation methods will perform well in this field. Although we show the effectiveness of the AD methods in an unsupervised way, our proposed method has less accuracy than the supervised methods in finding negative samples. New methods and new attention-based models can increase these valuable metrics.

Disclosure statement

No potential conflict of interest was reported by the author(s)

Additional information

Notes on contributors

Ali Nabipour

Ali Nabipour received his B.Sc. degree in Computer Engineering from the University of Guilan, Rasht, Iran, in 2024. He developed an interest in machine learning and its applications during his studies. He has published one paper in a peer-reviewed journal. His current research interests revolve around natural language processing and the application of large language models. Additionally, he is passionate about leveraging machine learning and deep learning algorithms to tackle challenges in interdisciplinary domains such as biomedical engineering.

Mohammad Javad Shams Nejati

Mohammad Javad Shams Nejati is a BSc degree student in computer engineering at the University of Guilan, located in Rasht, Iran. Since 2022 he has been working on the field of Computer Vision in the Faculty of Engineering at University of Guilan. He has published one paper in a peer-reviewed journal. His current research interests include Image Processing, Computer Vision and Machine Learning.

Yasaman Boreshban

Yasaman Boreshban completed her Ph.D. in Computer Engineering at the Department of Computer Engineering, Sharif University of Technology, in 2023. She has made significant contributions to the field of machine learning, particularly in Natural Language Processing (NLP), through her active participation in several publications in reputable journals. Her research primarily focuses on Machine Learning, Deep Learning, Natural Language Processing, and Image Processing.

Seyed Abolghasem Mirroshandel

Seyed Abolghasem Mirroshandel received his B.Sc. degree from University of Tehran in 2005 and the M.Sc. and Ph.D. degree from Sharif University of Technology, Tehran, Iran in 2007 and 2012 respectively. Since 2012, he has been with the Faculty of Engineering at University of Guilan in Rasht, Iran, where he is an Associate Professor of Computer Engineering. Dr. Mirroshandel has published more than 80 technical papers in peer-reviewed journals and conference proceedings. His current research interests focus on Natural Language Processing, Machine Learning, and Image Processing.

References

Abbasi A, Miahi E, Mirroshandel SA. 2021. Effect of deep transfer and multi-task learning on sperm abnormality detection. Comput Biol Med. 128:104121. doi: 10.1016/j.compbiomed.2020.104121.
PubMed Web of Science ®Google Scholar
Andrews J, Tanay T, Morton EJ, Griffin LD, 2016. Transfer representation-learning for anomaly detection. JMLR.
Google Scholar
Chandra S, Gourisaria MK, Gm H, Konar D, Gao X, Wang T, Xu M. 2022. Prolificacy assessment of spermatozoan via state-of-the-art deep learning frameworks. IEEE Acces. 10:13715–16. doi: 10.1109/ACCESS.2022.3146334.
PubMed Web of Science ®Google Scholar
Chang V, Garcia A, Hitschfeld N, Härtel S. 2017. Gold-standard for computer-assisted morphological sperm analysis. Comput Biol Med. 83:143–150. doi: 10.1016/j.compbiomed.2017.03.004.
PubMed Web of Science ®Google Scholar
Chen D, Mei JP, Zhang Y, Wang C, Wang Z, Feng Y, Chen C. 2021. Cross-layer distillation with semantic calibration. Proceedings of the AAAI Conference on Artificial Intelligence. p. 7028–7036.
Google Scholar
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. 2009. Imagenet: a large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. p. 248–255. doi: 10.1109/CVPR.2009.5206848.
Google Scholar
Ghasemian F, Mirroshandel SA, Monji-Azad S, Azarnia M, Zahiri Z. 2015. An efficient method for automatic morphological abnormality detection from human sperm images. Comput Methods Programs Biomed. 122(3):409–420. doi: 10.1016/j.cmpb.2015.08.013.
PubMed Web of Science ®Google Scholar
Glorot X, Bengio Y. 2010. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings; Sardina, Italy. p. 249–256.
Google Scholar
Goodfellow IJ, Shlens J, Szegedy C. 2014. Explaining and harnessing adversarial examples. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. doi: 10.48550/arxiv.1412.6572.
Google Scholar
Gou J, Yu B, Maybank SJ, Tao D. 2021. Knowledge distillation: a survey. Int J Comput Vis. 129(6):1789–1819. doi: 10.1007/s11263-021-01453-z.
Web of Science ®Google Scholar
Hinton G, Vinyals O, Dean J. 2015. Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 2.
Google Scholar
ILHAN H, Serbes G, Aydin N. 2022. Decision and feature level fusion of deep features extracted from public COVID-19 data-sets. Appl Intell. 52(8):8551–8571. doi: 10.17632/6XVDHC9FYB.1.
PubMed Web of Science ®Google Scholar
Ilhan HO, Serbes G, Aydin N. 2018a. Dual tree complex wavelet transform based sperm abnormality classification. 2018 41st International Conference on Telecommunications and Signal Processing (TSP); Athens, Greece. IEEE. p. 1–5.
Google Scholar
Ilhan HO, Sigirci IO, Serbes G, Aydin N. 2018b. The effect of nonlinear wavelet transform based de-noising in sperm abnormality classification. 2018 3rd International Conference on Computer Science and Engineering (UBMK); Bosnia and Herzegovina. IEEE. p. 658–661.
Google Scholar
Ilhan HO, Sigirci IO, Serbes G, Aydin N. 2020. A fully automated hybrid human sperm detection and classification system based on mobile-net and the performance comparison with conventional methods. Med Biol Eng Comput. 58(5):1047–1068. doi: 10.1007/s11517-019-02101-y.
PubMed Web of Science ®Google Scholar
Javadi S, Mirroshandel SA. 2019. A novel deep learning method for automatic assessment of human sperm images. Comput Biol Med. 109:182–194. doi: 10.1016/j.compbiomed.2019.04.030.
PubMed Web of Science ®Google Scholar
Krizhevsky A, Sutskever I, Hinton GE. 2017. Imagenet classification with deep convolutional neural networks. Commun ACM. 60(6):84–90. doi: 10.1145/3065386.
Web of Science ®Google Scholar
Li J, Tseng KK, Dong H, Li Y, Zhao M, Ding M. 2014. Human sperm health diagnosis with principal component analysis and k-nearest neighbor algorithm. 2014 International Conference on Medical Biometrics; Shenzhen, China. IEEE. p. 108–113.
Google Scholar
Liu R, Wang M, Wang M, Yin J, Yuan Y, Liu J. 2021. Automatic microscopy analysis with transfer learning for classification of human sperm. Appl Sci. 11(12):5369. doi: 10.3390/app11125369.
Google Scholar
Lv Q, Yuan X, Qian J, Li X, Zhang H, Zhan S. 2022. An improved u-net for human sperm head segmentation. Neural Process Lett. 54(1):537–557. doi: 10.1007/s11063-021-10643-2.
Web of Science ®Google Scholar
Melendez R, Castañón CB, Medina-Rodrguez R. 2021. Sperm cell segmentation in digital micrographs based on convolutional neural networks using u-net architecture. 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS). IEEE. p. 91–96.
Google Scholar
Miahi E, Mirroshandel SA, Nasr A. 2022. Genetic neural architecture search for automatic assessment of human sperm images. Expert Syst Appl. 188:115937. doi: 10.1016/j.eswa.2021.115937.
Web of Science ®Google Scholar
Mirroshandel SA, Ghasemian F. 2018. Automated morphology detection from human sperm images Intracytoplasmic Sperm Injection. p. 99–122. https://link.springer.com/chapter/10.1007/978-3-319-70497-5_8.
Google Scholar
Mirzadeh SI, Farajtabar M, Li A, Levine N, Matsukawa A, Ghasemzadeh H. 2020. Improved knowledge distillation via teacher assistant. Proceedings of the AAAI conference on artificial intelligence; New york, USA. p. 5191–5198.
Google Scholar
Perera P, Nallapati R, Xiang B. 2019. Ocgan: one-class novelty detection using gans with constrained latent representations. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, USA. p. 2898–2906.
Google Scholar
Revollo NV, Sarmiento G, Delrieux C, Herrera M, González-José R. 2022. Supervised machine learning classification of human sperm head based on morphological features. Trends and advancements of image processing and its applications. Springer; p. 177–191.
Google Scholar
Riordon J, McCallum C, Sinton D. 2019. Deep learning for the classification of human sperm. Comput Biol Med. 111:103342. doi: 10.1016/j.compbiomed.2019.103342.
PubMed Web of Science ®Google Scholar
Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y. 2014. Fitnets: hints for thin deep nets. arXivarXiv preprint arXiv:1412.6550.
Google Scholar
Ronneberger O, Fischer P, Brox T. 2015. U-net: convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention; Munich, Germany. Springer. p. 234–241.
Google Scholar
Salehi M, Arya A, Pajoum B, Otoofi M, Shaeiri A, Rohban MH, Rabiee HR, 2021. Arae: adversarially robust training of autoencoders improves novelty detection. www.aaai.org.
Google Scholar
Salehi M, Sadjadi N, Baselizadeh S, Rohban MH, Rabiee HR. 2021. Multiresolution knowledge distillation for anomaly detection. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Nashville, USA. p. 14902–14912.
Google Scholar
Schlegl T, Seeböck P, Waldstein SM, Langs G, Schmidt-Erfurth U. 2019. F-anogan: fast unsupervised anomaly detection with generative adversarial networks. Med Image Anal. 54:30–44. doi: 10.1016/j.media.2019.01.010.
PubMed Web of Science ®Google Scholar
Shaker F, Monadjemi SA, Alirezaie J, 2017. Classification of human sperm heads using elliptic features and lda. 2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA); Shahrekord, Iran. IEEE. p. 151–155.
Google Scholar
Shaker F, Monadjemi SA, Alirezaie J, Naghsh-Nilchi AR. 2017. A dictionary learning approach for human sperm heads classification. Comput Biol Med. 91:181–190. doi: 10.1016/j.compbiomed.2017.10.009.
PubMed Web of Science ®Google Scholar
Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Google Scholar
Skoracka K, Eder P, Łykowska Szuber L, Dobrowolska A, Krela-Kaźmierczak I. 2020. Diet and nutritional factors in male (in)fertility—underestimated factors. JCM. 9(5):1400. doi: 10.3390/JCM9051400.
Google Scholar
Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M. 2014. Striving for simplicity: the all convolutional net. arXivarXiv preprint arXiv:1412.6806.
Google Scholar
Tseng KK, Li Y, Hsu CY, Huang HN, Zhao M, Ding M. 2013. Computer-assisted system with multiple feature fused support vector machine for sperm morphology diagnosis. BioMed Res Int. 2013:1–13. doi: 10.1155/2013/687607.
Web of Science ®Google Scholar
WHO. 2021. WHO laboratory manual for the examination and processing of human semen. Geneva, Switzerland: World Health Organization.
Google Scholar
Zhang L, Song J, Gao A, Chen J, Bao C, Ma K, 2019. Be your own teacher: improve the performance of convolutional neural networks via self distillation. Proceedings of the IEEE/CVF International Conference on Computer Vision; Seoul, Korea. p. 3713–3722.
Google Scholar
Zhang Y, Xiang T, Hospedales TM, Lu H. 2018. Deep mutual learning. Proceedings of the IEEE conference on computer vision and pattern recognition; Salt Lake City, USA. p. 4320–4328.
Google Scholar
Zhang Y, Zhang J, Zha X, Zhou Y, Cao Y, Chen D. 2022. Improving human sperm head morphology classification with unsupervised anatomical feature distillation. 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI); Kolkata, India. IEEE. p. 1–5.
Google Scholar

Appendix

Appendix A. Table of abbreviations used in the text with their description.

Download CSV Display Table

Less-supervised learning with knowledge distillation for sperm morphology analysis

ABSTRACT

1. Introduction

2. Related works

2.1. Machine learning-based model

2.2. Deep learning-based models

2.3. Less-supervised learning-based models

3. Dataset

Table 1. Number of positive and negative samples in MHSMA dataset. There are 1,540 sperm images in the dataset labelled as normal or abnormal.

4. Proposed method

4.1. Architecture

4.1.1. The teacher network

4.1.2. The student network

4.2. Loss

4.3. Improving model robustness

4.3.1. Data augmentation

4.3.2. Adversarial attack

5. Results

5.1. Settings

5.2. Experiments

Table 2. The outcomes of each part of sperm produced by the proposed method. The optimal setup of the suggested approach on the test set yielded these results.

Table 3. Confusion matrix of the proposed method for each part of the sperm. These results are achieved from the evaluation phase on the test set.

Table 4. Comparison of our results with those achieved by Ghasemian et al. (Citation2015) and Javadi and Mirroshandel (Citation2019) on each part of sperm evaluated on the test set of the MHSMA dataset.

Table 5. ANOVA test results for comparison of our results with those achieved by Javadi and Mirroshandel (Citation2019).

Table 6. Results achieved with distinct λ values on validation set. Different λ values can balance the effect of multiple losses.

Table 7. Number of trainable parameters in three proposed models to show the effect of distillation. Changing the architecture of each network changes the number of its trainable parameters.

Table 8. Effect of critical layers on the performance of the proposed method. These results show how changing the number of critical layers in loss calculation can affect performance.

Table 9. Model performance metrics comparison with and without data augmentation and adversarial attack.

6. Conclusion

Disclosure statement

Additional information

Notes on contributors

Ali Nabipour

Mohammad Javad Shams Nejati

Yasaman Boreshban

Seyed Abolghasem Mirroshandel

Unknown widget #5d0ef076-e0a7-421c-8315-2b007028953f

of type scholix-links

References

Appendix

Appendix A. Table of abbreviations used in the text with their description.

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature

Table 6. Results achieved with distinct $λ$ values on validation set. Different $λ$ values can balance the effect of multiple losses.