한국센서학회 학술지영문홈페이지

Current Issue

JOURNAL OF SENSOR SCIENCE AND TECHNOLOGY - Vol. 34 , No. 3

[ Article ]
JOURNAL OF SENSOR SCIENCE AND TECHNOLOGY - Vol. 34, No. 3, pp. 174-179
Abbreviation: J. Sens. Sci. Technol.
ISSN: 1225-5475 (Print) 2093-7563 (Online)
Print publication date 31 May 2025
Received 27 Mar 2025 Revised 07 Apr 2025 Accepted 28 Apr 2025
DOI: https://doi.org/10.46670/JSST.2025.34.3.174

Vision Transformer-Based Pneumonia Detection for Chest X-Ray Images
Jomin Johnson1 ; E.Bijolin Edwin1 ; Ebenezer V2 ; Stewart Kirubakaran S1 ; M.Roshni Thanka2
1Division of Computer Science and Engineering, Karunya Institute of Technology and Sciences, Coimbatore 641114, Tamil Nadu, India
2Division of Data Science and Cyber Security, Karunya Institute of Technology and Sciences, Coimbatore 641114, Tamil Nadu, India

Correspondence to : +bijolin@karunya.edu.in


This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(https://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Pneumonia remains a leading global cause of mortality, necessitating timely and accurate diagnosis to prevent severe outcomes. While conventional convolutional neural networks (CNNs) are effective in analyzing chest X-rays (CXRs), they often fail to capture the global contextual information critical for detecting subtle pneumonia features. This study proposes a Vision Transformer (ViT)-based framework optimized for pneumonia diagnosis. Leveraging self-attention mechanisms, the model captures long-range spatial dependencies in CXRs. To address class imbalance, the architecture incorporates oversampling and weighted loss functions. Training was conducted on a publicly available Kaggle dataset comprising 5,863 images (4,273 pneumonia and 1,583 normal). Key enhancements include learnable positional encodings, patch-based input (16 × 16 patches), and linear learning rate scheduling (initial rate: 5e-5; warmup ratio: 0.1) to ensure stable training. The ViT model outperformed CNN baselines, achieving 97.61% accuracy, 95% sensitivity, 98% specificity, and an AUC-ROC of 0.96, surpassing visual geometry group 16-layer network (92.14%) and residual network with 50 layers (82.37%). Attention maps provided interpretable visualizations, highlighting pathological regions (e.g., lung opacities) consistent with clinical findings. Although computationally intensive, the framework demonstrated scalability and stability across 15 training epochs using gradient accumulation (steps = 4) to emulate larger batch sizes. These results underscore the ViT's potential as a robust tool for early pneumonia detection, particularly in resource-constrained settings.


Keywords: Vision transformer (ViT), Pneumonia detection, Chest X-ray (CXR), Self-attention mechanisms, Medical imaging, Deep learning

1. INTRODUCTION

Pneumonia remains a primary cause of global morbidity and mortality, accounting for approximately 1.4 million annual deaths, particularly among young children and the elderly [1]. While early and accurate diagnosis is crucial for improved clinical outcomes, traditional methods relying on chest X-ray (CXR) interpretation are time-intensive and require specialized expertise. Although convolutional neural networks (CNNs) [2-6] have advanced automated pneumonia detection, their diagnostic accuracy is constrained by localized receptive fields, often overlooking subtle, globally distributed disease patterns such as lung opacities and consolidations.

Vision Transformers (ViTs) [7-9], initially developed for natural language processing, have emerged as a promising alternative for medical image analysis. By employing self-attention mechanisms, ViTs model long-range dependencies within images, enabling a comprehensive interpretation of CXRs. Despite their success in natural image tasks, ViTs remain underexplored for pneumonia identification, a domain where global context is vital for differentiating overlapping pathologies.

To overcome two significant obstacles, this study suggests a ViT-based framework that is optimized for pneumonia detection:

Class imbalance: Weighted loss functions and deliberate oversampling are used to balance the dataset (5,863 CXRs: 4,273 pneumonia vs. 1,583 normal).

Interpretability: By emphasizing clinically significant places (such as alveolar infiltrates), attention maps connect artificial intelligence assessments with radiological data.

The proposed model outperforms CNNs such as visual geometry group 16-layer network (VGG16) (92.14%) and residual network with 50 layers (ResNet50) (82.37%) with 97.61% accuracy, 95% sensitivity, and 98% specificity after being trained on a publicly available Kaggle dataset [10]. The main improvements are

  • • Using learnable positional encodings for patch-based processing (16 × 16 patches) that preserve spatial linkages.
  • • Using linear learning rate scheduling (5e-5 starting rate, 0.1 warmup ratio) for stable optimization.
  • • Using gradient accumulation (steps = 4) to replicate larger batch sizes on devices with constrained resources.

This study enhances AI-driven diagnostic tools for universal access to healthcare by demonstrating ViT's superiority in identifying pneumonia.

The paper is organized as follows: Section 2 reviews ViT architectures and medical imaging applications. Section 3 describes the dataset and approach. Section 4 presents the experimental results. Section 5 discusses the consequences and future directions.


2. EXPERIMENT

The proposed ViT-based framework detects pneumonia in CXR images, overcoming limitations of traditional CNNs by better capturing global patterns and spatial relationships. The following is a breakdown of the framework:

2.1 Dataset Preparation and Preprocessing

The study utilizes the publicly available Chest X-Ray Images (Pneumonia) dataset from Kaggle, comprising 5,863 anonymized posterior-anterior CXR images categorized into two classes: 4,273 pneumonia cases and 1,583 normal cases (Fig. 1). To ensure reproducibility, the dataset is categorized into training (80%, n = 4,690), validation (10%, n = 586), and testing (10%, n = 587) subsets using a stratified split to preserve class distribution across splits and mitigate sampling bias. All images are uniformly resized to 224 × 224 pixels to align with the input dimensions of standard pre-trained CNNs and normalized to the [0, 1] range using min-max scaling to stabilize gradient updates during training. To enhance model generalizability and reduce overfitting, real-time data augmentation techniques are applied to the training set, including random rotations (±15°), horizontal flipping, and zooming using the TensorFlow-Keras ImageDataGenerator API.


Fig. 1. 
Sample CXR (normal and pneumonia) images

2.2 Vision Transformer Architecture

The ViT divides each CXR into 16 × 16-pixel patches, flattens them, and projects them into embeddings. Learnable positional encodings are added to retain spatial relationships. A transformer encoder processes these patches using multi-head self-attention (to identify global dependencies) and feed-forward neural networks. Finally, a classification head with a SoftMax layer labels the image as "normal" or "pneumonia" (Fig. 2).


Fig. 2. 
ViT Architecture for Chest X-ray Classification

2.3 Training Strategy

The model is trained for 30 epochs with a batch size of 16. Binary cross-entropy loss is used and weighted to prioritize the minority class. The Adam optimizer is applied with a learning rate of 1e-5 and gradual decay. Gradient accumulation stabilizes training by simulating larger batches. A full summary of hyperparameter settings is provided (Table 1).

Table 1. 
Hyperparameter settings used in the experiment
Hyperparameter Value
Batch size 16
Criterion CrossEntropyLoss
Learning rate le-05
Optimizer Adam
Device Cuda
Image resizes 224 × 224
The multiplicative factor of the learning rate 0.995

2.4 Comparison with Baseline Models

ViT is benchmarked against CNNs such as VGG16, ResNet50, and Densely connected convolutional network with 121 layers (DenseNet121). All baseline models were retrained under identical experimental conditions (epochs: 30, batch size: 16, optimizer: Adam with learning rate 1e-5, decay γ = 0.995, and weighted cross-entropy loss) for fair comparison. The results (Fig. 3) show that ViT outperforms these models in accuracy, sensitivity (recall), and specificity, proving ViTs strength in medical image analysis.


Fig. 3. 
Accuracy Comparison of Vision Transformer with Baseline Models for Pneumonia Detection

2.5 Workflow Overview

The Vision Transformer (ViT) architecture for medical image analysis follows a systematic workflow, as illustrated (Fig. 4). First, chest X-ray images are divided into patches and converted into vector embeddings. Positional encoding is applied to retain spatial relationships between patches, which are then processed by a transformer encoder using self-attention to capture global contextual dependencies. Finally, the output is fed into a classification head to distinguish between Normal and Pneumonia cases. This structured pipeline enables ViT to learn discriminative features while maintaining spatial coherence, contributing to its superior diagnostic performance


Fig. 4. 
ViT Workflow for Pneumonia Detection

2.6 Advantages of the Proposed Methodology
  • Global Context Modelling: ViTs capture long-range dependencies in CXRs, improving detection of subtle pneumonia patterns.
  • Scalability: The model handles high-resolution images without downsampling, preserving fine details.
  • Interpretability: Attention maps provide insights into model decisions, enhancing clinician trust.
  • Robustness: Data augmentation and class imbalance handling improve generalization to diverse CXR datasets.

3. RESULTS AND DISCUSSIONS
3.1 Model Performance Metrics

The ViT model displayed remarkable performance, achieving 97.61% accuracy, 95% sensitivity, and 98% specificity in diagnosing pneumonia, with an F1 score of 95.2% (balanced precision and recall). The model was trained for 30 epochs, and the accuracy and loss curves (Fig. 5 (a,b)) demonstrate stable learning without overfitting. To further validate robustness, 5-fold cross-validation on a held-out test subset (20% of the data) yielded a mean accuracy of 97.2% ± 0.3%. The evaluation process was fully automated, with the test set strictly reserved for blind evaluation. The area under the receiver operating characteristic curve (AUC-ROC) value of 0.96 (Fig. 6) confirms the model's strong ability to distinguish pneumonia from normal chest X-rays.


Fig. 5. 
(a) Loss curve demonstrating convergence (b) accuracy progression over 12 epochs


Fig. 6. 
Receiver operating characteristic curve

3.2 Statistical Validation

Statistical tests confirmed the model's reliability. The accuracy of the model is between 96.2% and 98.9%, according to the 95% confidence interval. Its near-perfect Matthews correlation coefficient of 0.9396 further illustrates how regularly it provides predictions. The model achieved a sensitivity of 98.1% and specificity of 95.7%, reflecting its robust ability to minimize both false negatives and false positives in clinical screening contexts. Additionally, precision–recall analysis yielded an F1-score of 0.974, demonstrating harmonized performance in balancing precision (97.3%) and recall (97.9%) despite the dataset's inherent class imbalance.

3.3 Model Interpretability and Clinical Relevance

Using attention maps that correlate to the places that physicians inspect, the model identifies indicators of pneumonia, such as lung opacities. According to the confusion matrix (Fig. 7), the model produced only 6 false positives (thinking normal is pneumonia) and 8 false negatives (missing pneumonia) out of 586 test images, resulting in a 97.61% accuracy.


Fig. 7. 
Confusion Matrix of ViT for Pneumonia vs. Normal Classification on Chest X-ray Test Set

3.4 Comparison with Baseline Models

The ViT outperformed popular CNNs such as VGG16 (92.14% accuracy) and ResNet50 (82.37%). It outperformed the best CNN (Quaternion Residual Network, 93.75%) by 3.86% accuracy (Table 2). ViT’s strength lies in analyzing the entire image for global patterns, unlike CNNs that focus on local details.

Table 2. 
Performance evaluation relative to other architecture
Architecture Reference Accuracy F-score
VGG16 [11] 92.14 0.9234
ResNet101V2 [12] 92.62 0.9250
ResNet152V2 [13] 92.94 0.9312
InceptionV3 [14] 89.42 0.8937
InceptionResNetV2 [15] 90.70 0.8989
DenseNet169 [16] 88.78 0.8874
DenseNet201 [17] 91.71 0.9171
Quaternion Residual Network [18] 93.75 0.9405
Vision Transformer This Work 97.61 0.9500

3.5 Training Strategy and Optimization

To address the class imbalance inherent in the dataset (where pneumonia cases significantly outnumbered normal cases), a weighted cross-entropy loss function was implemented. This approach assigns higher penalty weights to misclassifications of the minority class (normal events) during training, effectively prioritizing their learning and mitigating bias toward the overrepresented pneumonia class. The weighting scheme was computed inversely proportional to class frequencies, ensuring the model did not overlook subtle patterns in underrepresented samples.

Adam optimizer was employed with an initial learning rate of 1 × 10−5 (or 1e-5) and an exponential decay schedule (γ = 0.995 per epoch). This configuration stabilized training dynamics by gradually reducing step sizes, avoiding overshooting local minima while maintaining responsiveness to critical gradients.

These methodological choices—coupled with data augmentation and gradient clipping—enhanced the model's sensitivity to latent features associated with early-stage pneumonia. As shown in Fig. 8, the refined architecture achieved superior performance in detecting mild pneumonia manifestations, particularly in cases with ambiguous radiographic signatures.


Fig. 8. 
(a) Precision curve, (b) Recall curve across classification thresholds on the held-out test set.

3.6 Limitations and Future Directions

While ViTs demonstrate superior performance in capturing global contextual relationships within medical images, their computational demands significantly exceed those of CNNs. This disparity stems from ViTs' reliance on self-attention mechanisms, which scale quadratically with input resolution (O(n2) O(n2) complexity), compared to CNNs' spatially invariant convolutional operations (O(n) O(n) complexity). Consequently, ViTs require specialized hardware (e.g., high-memory graphics processing units) and optimized frameworks (e.g., mixed-precision training) for practical deployment—a challenge in resource-constrained clinical environments.

To ensure robustness and generalizability, rigorous validation on smaller, multi-center datasets (e.g., pediatric or geriatric cohorts) and cross-domain imaging modalities (e.g., X-rays from diverse manufacturers) is critical. Future work will incorporate full k-fold cross-validation to further mitigate sampling bias. Such evaluations would quantify the model's resilience to covariate shifts and demographic biases, addressing overfitting to narrowly curated training data.


4. CONCLUSIONS

This study addresses the limitations of traditional CNNs by employing self-attention mechanisms to capture global context and spatial correlations, resulting in a ViT-based framework for diagnosing pneumonia in CXRs. The proposed model outperformed established CNN architectures, including VGG16, ResNet50, and DenseNet121, achieving an accuracy of 97.61%, sensitivity of 95%, and specificity of 98%.

To mitigate class imbalance (4,273 pneumonia vs. 1,583 normal cases), strategic oversampling and a weighted loss function were applied to support balanced learning. The ViT's patch-based processing and multi-head self-attention enabled accurate detection of subtle pneumonia features, such as lung opacities, while optimized hyperparameters (learning rate: 5e-5, batch size: 16) ensured stable training, reaching 97.42% validation accuracy. Attention maps further enhanced interpretability by highlighting clinically relevant regions, such as consolidated lung areas, aligning the model's decision-making with radiological knowledge. Despite these advancements, challenges remain, including higher computational demands compared to CNNs and reliance on a single public dataset.


CRediT Authorship Contribution Statement

Jomin Johnson: Conceptualization, Data curation, Formal analysis, Writing-original draft, Writing-review & editing. E.Bijolin Edwin: Conceptualization, Formal analysis, Writing-review & editing. M.Roshni Thanka: Conceptualization, Writing-review & editing. Ebenezer: Data curation, Writing-review & editing. Stewart Kirubakaran: Formal analysis, Writing-review & editing.


Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.


Acknowledgments

The authors sincerely thank the SUSE Centre of Excellence for Cloud Computing at Karunya Institute of Technology and Sciences for providing the necessary resources and support for this research.


References
1. World Health Organization, Pneumonia in children. https://www.who.int/news-room/fact-sheets/detail/pneumonia, 2019 (accessed 15 March 2023).
2. S. H. Khan, J. Iqbal, S. A. Hassnain, M. Owais, S. M. Mostafa, M. Hadjouni, et al., COVID-19 detection and analysis from lung CT images using novel channel boosted CNNs, Expert Syst. Appl. 229 (2022) 120477.
3. S. H. Khan, A. Sohail, A. Khan, M. Hassan, Y. S. Lee, J. Alam, et al., COVID-19 detection in chest X-ray images using deep boosted hybrid learning, Comput. Biol. Med. 137 (2021) 104816.
4. G. Liang, L. Zheng, A transfer learning method with deep residual network for pediatric pneumonia diagnosis, Computer Methods Programs Biomed. 187 (2020) 104964.
5. M. Nishio, S. Noguchi, H. Matsuo, T. Murakami, Automatic classification between COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy on chest X-ray image: Combination of data augmentation methods, Sci. Rep. 10 (2020) 17532.
6. S. Singh, B.K. Tripathi, S.S. Rawat, Deep quaternion convolutional neural networks for breast cancer classification, Multimedia Tools Appl. 82 (2023) 31285–31308.
7. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017) 5999–6009.
8. F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, et al., Residual attention network for image classification, Proceedings of IEEE Conf. Comput. Vis. Pattern Recognit., Honolulu, Hawaii, 2017, pp. 6450–6458.
9. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, et al., Transformers: State-of-the-Art Natural Language Processing, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 2020, pp. 38–45.
10. L.T. Duong, P.T. Nguyen, L. Iovino, M. Flammini, Deep learning for automated recognition of COVID-19 from chest X-ray images, medRxiv (2020).
11. M. Hassan, VGG16—Convolutional network for classification and detection. https://neurohive.io/en/popular-networks/vgg16, 2018 (accessed 15 March 2023).
12. H.C. Lee, A.F. Aqil, Combination of transfer learning methods for kidney glomeruli image classification, Appl. Sci. 12 (2022) 1040.
13. S. Albahli, H.T. Rauf, A. Algosaibi, V.E. Balas, AI-driven deep CNN approach for multilabel pathology classification using chest X-rays, PeerJ Comput. Sci. 7 (2021) e495.
14. G.J. Chowdary, N.S. Punn, S.K. Sonbhadra, S. Agarwal, Face mask detection using transfer learning of inceptionV3, Proceedings of the 8th International Conference on Big Data Analytics (BDA 2020), Sonepat, India, 2020, pp. 81–90.
15. M.R.H. Mondal, S. Bharati, P. Podder, CO-IRv2: Optimized InceptionResNetV2 for COVID-19 detection from chest CT images, PLoS ONE 16 (2021) e0259179.
16. D. Ezzat, A.E. Hassanien, H.A. Ella, An optimized deep learning architecture for the diagnosis of COVID-19 disease based on gravitational search optimization, Appl. Soft Comput. 98 (2021) 106742.
17. F.D. Adhinata, D.P. Rakhmadani, M. Wibowo, A. Jayadi, A deep learning using DenseNet201 to detect masked or non-masked face, JUITA 9 (2021) 115–121.
18. S. Singh, B.K. Tripathi, Pneumonia classification using quaternion deep learning, Multimedia Tools Appl. 81 (2022) 1743–1764.
19. Q. Zhang, A novel ResNet101 model based on dense dilated convolution for image classification, SN Appl. Sci. 4 (2022) 1–13.
20. M. Rahimzadeh, A. Attar, A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of Xception and ResNet50V2, Inf. Med. Unlocked 19 (2020) 100360.
21. U.N. Oktaviana, Y. Azhar, Garbage classification using ensemble DenseNet169, J. RESTI 5 (2021) 1207–1215.
22. G. Yang, Y. He, Y. Yang, B. Xu, Fine-grained image classification for crop disease based on attention mechanism, Front. Plant Sci. 11 (2020) 600854.