About | For Authors | For Reviewers | Article | Ethical Guideline | e-Submission |
Sorry.
You are not permitted to access the full text of articles.
If you have any questions about permissions,
please contact the Society.
죄송합니다.
회원님은 논문 이용 권한이 없습니다.
권한 관련 문의는 학회로 부탁 드립니다.
[ Article ] | |
JOURNAL OF SENSOR SCIENCE AND TECHNOLOGY - Vol. 34, No. 3, pp. 174-179 | |
Abbreviation: J. Sens. Sci. Technol. | |
ISSN: 1225-5475 (Print) 2093-7563 (Online) | |
Print publication date 31 May 2025 | |
Received 27 Mar 2025 Revised 07 Apr 2025 Accepted 28 Apr 2025 | |
DOI: https://doi.org/10.46670/JSST.2025.34.3.174 | |
Vision Transformer-Based Pneumonia Detection for Chest X-Ray Images | |
Jomin Johnson1 ![]() ![]() ![]() ![]() ![]() | |
1Division of Computer Science and Engineering, Karunya Institute of Technology and Sciences, Coimbatore 641114, Tamil Nadu, India | |
2Division of Data Science and Cyber Security, Karunya Institute of Technology and Sciences, Coimbatore 641114, Tamil Nadu, India | |
Correspondence to : +bijolin@karunya.edu.in | |
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(https://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. | |
Pneumonia remains a leading global cause of mortality, necessitating timely and accurate diagnosis to prevent severe outcomes. While conventional convolutional neural networks (CNNs) are effective in analyzing chest X-rays (CXRs), they often fail to capture the global contextual information critical for detecting subtle pneumonia features. This study proposes a Vision Transformer (ViT)-based framework optimized for pneumonia diagnosis. Leveraging self-attention mechanisms, the model captures long-range spatial dependencies in CXRs. To address class imbalance, the architecture incorporates oversampling and weighted loss functions. Training was conducted on a publicly available Kaggle dataset comprising 5,863 images (4,273 pneumonia and 1,583 normal). Key enhancements include learnable positional encodings, patch-based input (16 × 16 patches), and linear learning rate scheduling (initial rate: 5e-5; warmup ratio: 0.1) to ensure stable training. The ViT model outperformed CNN baselines, achieving 97.61% accuracy, 95% sensitivity, 98% specificity, and an AUC-ROC of 0.96, surpassing visual geometry group 16-layer network (92.14%) and residual network with 50 layers (82.37%). Attention maps provided interpretable visualizations, highlighting pathological regions (e.g., lung opacities) consistent with clinical findings. Although computationally intensive, the framework demonstrated scalability and stability across 15 training epochs using gradient accumulation (steps = 4) to emulate larger batch sizes. These results underscore the ViT's potential as a robust tool for early pneumonia detection, particularly in resource-constrained settings.
Keywords: Vision transformer (ViT), Pneumonia detection, Chest X-ray (CXR), Self-attention mechanisms, Medical imaging, Deep learning |
Pneumonia remains a primary cause of global morbidity and mortality, accounting for approximately 1.4 million annual deaths, particularly among young children and the elderly [1]. While early and accurate diagnosis is crucial for improved clinical outcomes, traditional methods relying on chest X-ray (CXR) interpretation are time-intensive and require specialized expertise. Although convolutional neural networks (CNNs) [2-6] have advanced automated pneumonia detection, their diagnostic accuracy is constrained by localized receptive fields, often overlooking subtle, globally distributed disease patterns such as lung opacities and consolidations.
Vision Transformers (ViTs) [7-9], initially developed for natural language processing, have emerged as a promising alternative for medical image analysis. By employing self-attention mechanisms, ViTs model long-range dependencies within images, enabling a comprehensive interpretation of CXRs. Despite their success in natural image tasks, ViTs remain underexplored for pneumonia identification, a domain where global context is vital for differentiating overlapping pathologies.
To overcome two significant obstacles, this study suggests a ViT-based framework that is optimized for pneumonia detection:
Class imbalance: Weighted loss functions and deliberate oversampling are used to balance the dataset (5,863 CXRs: 4,273 pneumonia vs. 1,583 normal).
Interpretability: By emphasizing clinically significant places (such as alveolar infiltrates), attention maps connect artificial intelligence assessments with radiological data.
The proposed model outperforms CNNs such as visual geometry group 16-layer network (VGG16) (92.14%) and residual network with 50 layers (ResNet50) (82.37%) with 97.61% accuracy, 95% sensitivity, and 98% specificity after being trained on a publicly available Kaggle dataset [10]. The main improvements are
This study enhances AI-driven diagnostic tools for universal access to healthcare by demonstrating ViT's superiority in identifying pneumonia.
The paper is organized as follows: Section 2 reviews ViT architectures and medical imaging applications. Section 3 describes the dataset and approach. Section 4 presents the experimental results. Section 5 discusses the consequences and future directions.
The proposed ViT-based framework detects pneumonia in CXR images, overcoming limitations of traditional CNNs by better capturing global patterns and spatial relationships. The following is a breakdown of the framework:
The study utilizes the publicly available Chest X-Ray Images (Pneumonia) dataset from Kaggle, comprising 5,863 anonymized posterior-anterior CXR images categorized into two classes: 4,273 pneumonia cases and 1,583 normal cases (Fig. 1). To ensure reproducibility, the dataset is categorized into training (80%, n = 4,690), validation (10%, n = 586), and testing (10%, n = 587) subsets using a stratified split to preserve class distribution across splits and mitigate sampling bias. All images are uniformly resized to 224 × 224 pixels to align with the input dimensions of standard pre-trained CNNs and normalized to the [0, 1] range using min-max scaling to stabilize gradient updates during training. To enhance model generalizability and reduce overfitting, real-time data augmentation techniques are applied to the training set, including random rotations (±15°), horizontal flipping, and zooming using the TensorFlow-Keras ImageDataGenerator API.
The ViT divides each CXR into 16 × 16-pixel patches, flattens them, and projects them into embeddings. Learnable positional encodings are added to retain spatial relationships. A transformer encoder processes these patches using multi-head self-attention (to identify global dependencies) and feed-forward neural networks. Finally, a classification head with a SoftMax layer labels the image as "normal" or "pneumonia" (Fig. 2).
The model is trained for 30 epochs with a batch size of 16. Binary cross-entropy loss is used and weighted to prioritize the minority class. The Adam optimizer is applied with a learning rate of 1e-5 and gradual decay. Gradient accumulation stabilizes training by simulating larger batches. A full summary of hyperparameter settings is provided (Table 1).
Hyperparameter | Value |
---|---|
Batch size | 16 |
Criterion | CrossEntropyLoss |
Learning rate | le-05 |
Optimizer | Adam |
Device | Cuda |
Image resizes | 224 × 224 |
The multiplicative factor of the learning rate | 0.995 |
ViT is benchmarked against CNNs such as VGG16, ResNet50, and Densely connected convolutional network with 121 layers (DenseNet121). All baseline models were retrained under identical experimental conditions (epochs: 30, batch size: 16, optimizer: Adam with learning rate 1e-5, decay γ = 0.995, and weighted cross-entropy loss) for fair comparison. The results (Fig. 3) show that ViT outperforms these models in accuracy, sensitivity (recall), and specificity, proving ViTs strength in medical image analysis.
The Vision Transformer (ViT) architecture for medical image analysis follows a systematic workflow, as illustrated (Fig. 4). First, chest X-ray images are divided into patches and converted into vector embeddings. Positional encoding is applied to retain spatial relationships between patches, which are then processed by a transformer encoder using self-attention to capture global contextual dependencies. Finally, the output is fed into a classification head to distinguish between Normal and Pneumonia cases. This structured pipeline enables ViT to learn discriminative features while maintaining spatial coherence, contributing to its superior diagnostic performance
The ViT model displayed remarkable performance, achieving 97.61% accuracy, 95% sensitivity, and 98% specificity in diagnosing pneumonia, with an F1 score of 95.2% (balanced precision and recall). The model was trained for 30 epochs, and the accuracy and loss curves (Fig. 5 (a,b)) demonstrate stable learning without overfitting. To further validate robustness, 5-fold cross-validation on a held-out test subset (20% of the data) yielded a mean accuracy of 97.2% ± 0.3%. The evaluation process was fully automated, with the test set strictly reserved for blind evaluation. The area under the receiver operating characteristic curve (AUC-ROC) value of 0.96 (Fig. 6) confirms the model's strong ability to distinguish pneumonia from normal chest X-rays.
Statistical tests confirmed the model's reliability. The accuracy of the model is between 96.2% and 98.9%, according to the 95% confidence interval. Its near-perfect Matthews correlation coefficient of 0.9396 further illustrates how regularly it provides predictions. The model achieved a sensitivity of 98.1% and specificity of 95.7%, reflecting its robust ability to minimize both false negatives and false positives in clinical screening contexts. Additionally, precision–recall analysis yielded an F1-score of 0.974, demonstrating harmonized performance in balancing precision (97.3%) and recall (97.9%) despite the dataset's inherent class imbalance.
Using attention maps that correlate to the places that physicians inspect, the model identifies indicators of pneumonia, such as lung opacities. According to the confusion matrix (Fig. 7), the model produced only 6 false positives (thinking normal is pneumonia) and 8 false negatives (missing pneumonia) out of 586 test images, resulting in a 97.61% accuracy.
The ViT outperformed popular CNNs such as VGG16 (92.14% accuracy) and ResNet50 (82.37%). It outperformed the best CNN (Quaternion Residual Network, 93.75%) by 3.86% accuracy (Table 2). ViT’s strength lies in analyzing the entire image for global patterns, unlike CNNs that focus on local details.
Architecture | Reference | Accuracy | F-score |
---|---|---|---|
VGG16 | [11] | 92.14 | 0.9234 |
ResNet101V2 | [12] | 92.62 | 0.9250 |
ResNet152V2 | [13] | 92.94 | 0.9312 |
InceptionV3 | [14] | 89.42 | 0.8937 |
InceptionResNetV2 | [15] | 90.70 | 0.8989 |
DenseNet169 | [16] | 88.78 | 0.8874 |
DenseNet201 | [17] | 91.71 | 0.9171 |
Quaternion Residual Network | [18] | 93.75 | 0.9405 |
Vision Transformer | This Work | 97.61 | 0.9500 |
To address the class imbalance inherent in the dataset (where pneumonia cases significantly outnumbered normal cases), a weighted cross-entropy loss function was implemented. This approach assigns higher penalty weights to misclassifications of the minority class (normal events) during training, effectively prioritizing their learning and mitigating bias toward the overrepresented pneumonia class. The weighting scheme was computed inversely proportional to class frequencies, ensuring the model did not overlook subtle patterns in underrepresented samples.
Adam optimizer was employed with an initial learning rate of 1 × 10−5 (or 1e-5) and an exponential decay schedule (γ = 0.995 per epoch). This configuration stabilized training dynamics by gradually reducing step sizes, avoiding overshooting local minima while maintaining responsiveness to critical gradients.
These methodological choices—coupled with data augmentation and gradient clipping—enhanced the model's sensitivity to latent features associated with early-stage pneumonia. As shown in Fig. 8, the refined architecture achieved superior performance in detecting mild pneumonia manifestations, particularly in cases with ambiguous radiographic signatures.
While ViTs demonstrate superior performance in capturing global contextual relationships within medical images, their computational demands significantly exceed those of CNNs. This disparity stems from ViTs' reliance on self-attention mechanisms, which scale quadratically with input resolution (O(n2) O(n2) complexity), compared to CNNs' spatially invariant convolutional operations (O(n) O(n) complexity). Consequently, ViTs require specialized hardware (e.g., high-memory graphics processing units) and optimized frameworks (e.g., mixed-precision training) for practical deployment—a challenge in resource-constrained clinical environments.
To ensure robustness and generalizability, rigorous validation on smaller, multi-center datasets (e.g., pediatric or geriatric cohorts) and cross-domain imaging modalities (e.g., X-rays from diverse manufacturers) is critical. Future work will incorporate full k-fold cross-validation to further mitigate sampling bias. Such evaluations would quantify the model's resilience to covariate shifts and demographic biases, addressing overfitting to narrowly curated training data.
This study addresses the limitations of traditional CNNs by employing self-attention mechanisms to capture global context and spatial correlations, resulting in a ViT-based framework for diagnosing pneumonia in CXRs. The proposed model outperformed established CNN architectures, including VGG16, ResNet50, and DenseNet121, achieving an accuracy of 97.61%, sensitivity of 95%, and specificity of 98%.
To mitigate class imbalance (4,273 pneumonia vs. 1,583 normal cases), strategic oversampling and a weighted loss function were applied to support balanced learning. The ViT's patch-based processing and multi-head self-attention enabled accurate detection of subtle pneumonia features, such as lung opacities, while optimized hyperparameters (learning rate: 5e-5, batch size: 16) ensured stable training, reaching 97.42% validation accuracy. Attention maps further enhanced interpretability by highlighting clinically relevant regions, such as consolidated lung areas, aligning the model's decision-making with radiological knowledge. Despite these advancements, challenges remain, including higher computational demands compared to CNNs and reliance on a single public dataset.
Jomin Johnson: Conceptualization, Data curation, Formal analysis, Writing-original draft, Writing-review & editing. E.Bijolin Edwin: Conceptualization, Formal analysis, Writing-review & editing. M.Roshni Thanka: Conceptualization, Writing-review & editing. Ebenezer: Data curation, Writing-review & editing. Stewart Kirubakaran: Formal analysis, Writing-review & editing.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
The authors sincerely thank the SUSE Centre of Excellence for Cloud Computing at Karunya Institute of Technology and Sciences for providing the necessary resources and support for this research.
1. | World Health Organization, Pneumonia in children. https://www.who.int/news-room/fact-sheets/detail/pneumonia, 2019 (accessed 15 March 2023). |
2. | S. H. Khan, J. Iqbal, S. A. Hassnain, M. Owais, S. M. Mostafa, M. Hadjouni, et al., COVID-19 detection and analysis from lung CT images using novel channel boosted CNNs, Expert Syst. Appl. 229 (2022) 120477.![]() |
3. | S. H. Khan, A. Sohail, A. Khan, M. Hassan, Y. S. Lee, J. Alam, et al., COVID-19 detection in chest X-ray images using deep boosted hybrid learning, Comput. Biol. Med. 137 (2021) 104816.![]() |
4. | G. Liang, L. Zheng, A transfer learning method with deep residual network for pediatric pneumonia diagnosis, Computer Methods Programs Biomed. 187 (2020) 104964.![]() |
5. | M. Nishio, S. Noguchi, H. Matsuo, T. Murakami, Automatic classification between COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy on chest X-ray image: Combination of data augmentation methods, Sci. Rep. 10 (2020) 17532.![]() |
6. | S. Singh, B.K. Tripathi, S.S. Rawat, Deep quaternion convolutional neural networks for breast cancer classification, Multimedia Tools Appl. 82 (2023) 31285–31308.![]() |
7. | A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017) 5999–6009. |
8. | F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, et al., Residual attention network for image classification, Proceedings of IEEE Conf. Comput. Vis. Pattern Recognit., Honolulu, Hawaii, 2017, pp. 6450–6458.![]() |
9. | T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, et al., Transformers: State-of-the-Art Natural Language Processing, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 2020, pp. 38–45.![]() |
10. | L.T. Duong, P.T. Nguyen, L. Iovino, M. Flammini, Deep learning for automated recognition of COVID-19 from chest X-ray images, medRxiv (2020).![]() |
11. | M. Hassan, VGG16—Convolutional network for classification and detection. https://neurohive.io/en/popular-networks/vgg16, 2018 (accessed 15 March 2023). |
12. | H.C. Lee, A.F. Aqil, Combination of transfer learning methods for kidney glomeruli image classification, Appl. Sci. 12 (2022) 1040.![]() |
13. | S. Albahli, H.T. Rauf, A. Algosaibi, V.E. Balas, AI-driven deep CNN approach for multilabel pathology classification using chest X-rays, PeerJ Comput. Sci. 7 (2021) e495.![]() |
14. | G.J. Chowdary, N.S. Punn, S.K. Sonbhadra, S. Agarwal, Face mask detection using transfer learning of inceptionV3, Proceedings of the 8th International Conference on Big Data Analytics (BDA 2020), Sonepat, India, 2020, pp. 81–90.![]() |
15. | M.R.H. Mondal, S. Bharati, P. Podder, CO-IRv2: Optimized InceptionResNetV2 for COVID-19 detection from chest CT images, PLoS ONE 16 (2021) e0259179.![]() |
16. | D. Ezzat, A.E. Hassanien, H.A. Ella, An optimized deep learning architecture for the diagnosis of COVID-19 disease based on gravitational search optimization, Appl. Soft Comput. 98 (2021) 106742.![]() |
17. | F.D. Adhinata, D.P. Rakhmadani, M. Wibowo, A. Jayadi, A deep learning using DenseNet201 to detect masked or non-masked face, JUITA 9 (2021) 115–121.![]() |
18. | S. Singh, B.K. Tripathi, Pneumonia classification using quaternion deep learning, Multimedia Tools Appl. 81 (2022) 1743–1764.![]() |
19. | Q. Zhang, A novel ResNet101 model based on dense dilated convolution for image classification, SN Appl. Sci. 4 (2022) 1–13.![]() |
20. | M. Rahimzadeh, A. Attar, A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of Xception and ResNet50V2, Inf. Med. Unlocked 19 (2020) 100360.![]() |
21. | U.N. Oktaviana, Y. Azhar, Garbage classification using ensemble DenseNet169, J. RESTI 5 (2021) 1207–1215.![]() |
22. | G. Yang, Y. He, Y. Yang, B. Xu, Fine-grained image classification for crop disease based on attention mechanism, Front. Plant Sci. 11 (2020) 600854.![]() |
The Journal of Sensor Science and Technology is the official journal of the Korean Sensors Society
#714, 22, Teheran-ro 7-gil, Gangnam-gu, 06130, Republic of Korea