AS-MRI Diagnostic Framework — Complete Technical Documentation

🎯Aim & Objectives

Aim

To develop an explainable hybrid deep learning framework that accurately detects and stages Ankylosing Spondylitis from sacroiliac joint MRI scans, providing automated segmentation, multi-stage classification, and interpretable visual explanations to support clinical decision-making.

Objectives

To develop an Attention U-Net segmentation model for automatic extraction of sacroiliac joint regions from MRI scans, eliminating the need for manual ROI selection.
To build a multi-output CNN classifier for simultaneous binary AS detection and 4-stage disease severity classification (Normal → Early → Moderate → Advanced).
To implement a Hybrid CNN-Transformer model that combines EfficientNetB0 spatial features with Transformer attention for global context modeling.
To integrate Grad-CAM explainability with a novel region-focused variant that highlights sacroiliac joint areas influencing predictions.
To evaluate models using accuracy, precision, recall, F1-score, IoU, and confusion matrices on a publicly available dataset.
To deploy a complete Flask web application enabling clinicians to upload MRI scans, receive predictions with confidence scores, and view visual explanations.

Plan of Action

Phase 1 (3 Weeks)

Literature Review

Deep dive into CNNs, Vision Transformers, Hybrid architectures, and XAI techniques for AS diagnosis

Phase 2 (2 Weeks)

Dataset Collection

Source MRI images from Kaggle Lumbar Coordinate Pretraining dataset, preprocess and organize

Phase 3 (2 Weeks)

Data Preprocessing

Resize, normalize, generate masks, apply feature-based labeling, split into train/test sets

Phase 4 (5 Weeks)

Model Development

Build Attention U-Net, Simple CNN, and Hybrid CNN-Transformer architectures

Phase 5 (2 Weeks)

Testing & Evaluation

Measure accuracy, precision, recall, F1-score, IoU, generate confusion matrices

Phase 6 (2 Weeks)

XAI Integration

Implement Grad-CAM and focused Grad-CAM for sacroiliac joint region visualization

Phase 7 (2 Weeks)

Web App Development

Build Flask application with authentication, upload, prediction, and history features

Phase 8 (2 Weeks)

Documentation

Compile findings, prepare technical report, finalize Dockerfile for deployment

2Dataset Creation & Collection

Step 1: Original Source — Kaggle Lumbar Coordinate Pretraining

Source: Kaggle: Lumbar Coordinate Pretraining Dataset

The original raw dataset contains lumbar spine MRI scans in NumPy (.npy) and JPEG formats from 4 medical imaging sources:

Source Folder	Files (.npy)	Files (.jpg)	Description
`processed_lsd`	516	516	Lumbar Spine Degeneration dataset
`processed_spider`	211	210	SPIDER spine segmentation dataset
`processed_osf`	35	35	Open Science Framework spine images
`processed_tseg`	479	479	T-SEG thoracolumbar segmentation dataset
Total	1,241	1,240

Additional CSV files provide spine coordinate annotations:

coords_pretrain.csv (277 KB) — Maps filename → source → x, y coordinates → spine level (L1/L2 to L5/S1)
coords_rsna_improved.csv (5.2 MB) — RSNA improved coordinates with conditions (Neural Foraminal Narrowing, etc.)

Step 2: Dataset Curation — From 1,241 Raw Images to 900 Curated Images

From the original 1,241 raw MRI files, 900 images were selected and processed through the following pipeline:

Source Selection: Images were selected from all 4 source datasets to ensure diversity in MRI acquisition parameters and spinal anatomy variations.
Format Conversion: Raw NumPy arrays and JPEGs were converted to standardized 256×256 grayscale PNG format.
Quality Filtering: Images were filtered for quality — removing corrupt, blank, or duplicate scans to arrive at 900 final images.
Mask Generation: Corresponding segmentation masks (256×256 PNG) were generated for each image to delineate sacroiliac joint regions. Each mask is ~815 bytes.
Augmentation: 3× augmentation per image applied during the dataset generation phase (rotation, flip, brightness adjustment, noise addition).
Annotation: A CSV file (dataset.csv) was created with image paths, mask paths, AS status labels, stage labels, bounding box coordinates.

Important Note: The original Kaggle dataset does NOT contain AS-specific labels. The AS labels and stage labels were generated using a feature-based analysis approach (see Section 3: Labeling Strategy).

Step 3: Final Dataset Structure

Dataset/
├── images/              # 900 PNG files (img_0000.png to img_0899.png)
│   ├── img_0000.png     # 256×256 grayscale MRI
│   ├── img_0001.png
│   └── ... (900 files)
├── masks/               # 900 PNG files (mask_0000.png to mask_0899.png)
│   ├── mask_0000.png    # 256×256 binary segmentation mask (~815 bytes each)
│   ├── mask_0001.png
│   └── ... (900 files)
├── annotations/
│   └── dataset.csv      # 900 rows with labels and bounding boxes
└── dataset_info.txt     # Dataset metadata summary

CSV Annotation Schema

Column	Type	Description	Example
image_id	String	Unique image identifier	img_0000.png
image_path	String	Relative path to image	images/img_0000.png
mask_path	String	Relative path to segmentation mask	masks/mask_0000.png
AS_status	Integer	0 = Negative, 1 = Positive	1
stage	Integer	0 = Normal, 1 = Early, 2 = Moderate, 3 = Advanced	1
stage_name	String	Human-readable stage label	Early
bbox_x1, bbox_y1	Integer	Top-left bounding box corner	76, 102
bbox_x2, bbox_y2	Integer	Bottom-right bounding box corner	179, 204

Class Distribution (Initial CSV Labels)

Binary Classification

Class	Count	%
AS Positive	469	52.1%
AS Negative	431	47.9%

Stage Distribution

Stage	Count	%
Normal (0)	557	61.9%
Early (1)	111	12.3%
Moderate (2)	129	14.3%
Advanced (3)	103	11.4%

Note: These initial CSV labels were later replaced by feature-based labels (see next section) to create more meaningful labels that correlate with actual image characteristics. The final labels used for training are the balanced feature-based labels from Cell 9 of the notebook.

3Labeling Strategy (Feature-Based)

Since the original Kaggle dataset did not include AS-specific labels, a two-iteration feature-based labeling approach was developed to assign clinically motivated labels based on actual image characteristics.

Iteration 1: Initial Feature-Based Labels (Notebook Cell 8)

Three features were extracted from each image:

Feature	Method	Thresholds
Brightness	`np.mean(img)`	45th percentile
Texture	`np.std(img)`	55th percentile
Structure (Edge Density)	`cv2.Canny(img, 50, 150)`	30th/50th/70th percentiles for staging

Rule: AS+ if brightness < 45th percentile AND texture > 55th percentile

Problem: This produced highly unbalanced labels — only 16 AS Positive vs 884 AS Negative. Staging was similarly skewed (8 Early, 8 Moderate, 0 Advanced). This was unusable for training.

Iteration 2: Balanced Feature-Based Labels (Notebook Cell 9) ✅ Final Version

Four features were extracted and combined into a composite score:

Feature	Method	Weight in Score	Clinical Rationale
Mean Intensity	`np.mean(img)`	40% of brightness	Overall tissue density
Lower-Half Intensity	Mean of bottom 50% of image	60% of brightness	Sacroiliac region is in lower spine
Texture (Std Dev)	`np.std(img)`	40% of total	Inflammatory changes vary texture
Edge Density	`cv2.Canny(img, 50, 150)`	30% of total	Structural damage increases edges

# Combined score formula
brightness_scores = mean_intensity × 0.4 + lower_half_intensity × 0.6
combined_score = brightness_scores × 0.3 + texture_scores × 0.4 + edge_density × 0.3

# Binary: AS+ if score > 52nd percentile
threshold_binary = np.percentile(combined_score, 52)

# Staging (for AS+ images only):
#   Advanced (3): score > 85th percentile
#   Moderate (2): score > 70th percentile
#   Early (1):    score > 60th percentile
#   Normal (0):   below 60th percentile (AS+ but no clear stage)

Resulting balanced distribution:

Label	Count	Result
AS Positive	432	Well balanced ✓
AS Negative	468	Well balanced ✓
Stage 0 (Normal)	540	Includes all AS- (468) + some AS+ (72)
Stage 1 (Early)	90	AS+ with moderate scores
Stage 2 (Moderate)	135	AS+ with high scores
Stage 3 (Advanced)	135	AS+ with highest scores

Train/Test Split

Split	Count	Ratio
Training	720 images	80%
Testing / Validation	180 images	20%

Method: train_test_split(test_size=0.2, random_state=42, stratify=y_binary) — stratified on binary labels to maintain class proportions in both sets.

4Model Architectures

Model 1: Attention U-Net Segmentation

Purpose: Automatic segmentation of sacroiliac joint regions — eliminates manual ROI extraction.

Property	Value
Parameters	7,869,572
Input Shape	(256, 256, 1) — grayscale
Output	(256, 256, 1) — sigmoid binary mask
File	attention_unet_model.keras (~94.6 MB)
Encoder Filters	64 → 128 → 256 → 512 (bottleneck)
Decoder	3 levels with UpSampling2D + Conv2D(2) + Attention Gate + Concatenate
Activation	ReLU (hidden), Sigmoid (output)
Notebook Cell	Cell 3 (build) + Cell 4 (train)

Encoder Path (each level = 2× Conv2D + MaxPool2D):
  Level 1: Conv(64) → Conv(64) → Pool         # 256→128
  Level 2: Conv(128) → Conv(128) → Pool        # 128→64
  Level 3: Conv(256) → Conv(256) → Pool        # 64→32
  Bottleneck: Conv(512) → Conv(512)             # 32×32

Decoder Path (each level = UpSample + AttGate + Concat + 2× Conv2D):
  Up Level 3: UpSample(2) → Conv(256,2) → AttentionGate(conv3, up, 256) → Concat → Conv(256)×2
  Up Level 2: UpSample(2) → Conv(128,2) → AttentionGate(conv2, up, 128) → Concat → Conv(128)×2
  Up Level 1: UpSample(2) → Conv(64,2)  → AttentionGate(conv1, up, 64)  → Concat → Conv(64)×2

Output: Conv2D(1, kernel=1, activation='sigmoid')

Attention Gate Mechanism:

θ(x) = Conv2D(inter_ch, 1)(skip_connection)   # Transform skip features
φ(g) = Conv2D(inter_ch, 1)(gating_signal)     # Transform decoder features
ψ    = sigmoid(Conv2D(1, 1)(relu(θ + φ)))      # Compute attention coefficients
output = skip_connection × ψ                    # Apply learned attention weighting

Why Attention U-Net? Standard U-Net passes all skip connection features equally. The attention gates learn to suppress irrelevant background regions (muscle, fat) and highlight the small sacroiliac joint structures, which is critical since the joint occupies only ~15-20% of the full MRI field of view.

Model 2: Simple CNN Classifier Classifier — Primary

Purpose: Dual-output classification — simultaneous AS detection + disease stage classification. ★ Best Performing Model

Property	Value
Parameters	16,870,790
Input Shape	(256, 256, 1)
Outputs	2 heads: binary_output (2 classes, softmax) + stage_output (4 classes, softmax)
File	cnn_classifier_model.keras (~202.5 MB)
Notebook Cell	Cell 7 (build_simple_cnn)

Input(256, 256, 1)
  → Conv2D(32, 3, relu, same) → MaxPooling2D(2)      # 256→128
  → Conv2D(64, 3, relu, same) → MaxPooling2D(2)      # 128→64
  → Conv2D(128, 3, relu, same) → MaxPooling2D(2)     # 64→32
  → Flatten()                                          # 32×32×128 = 131,072
  → Dense(128, relu) → Dropout(0.5)
  ├── Dense(2, softmax) → binary_output  [AS Negative / AS Positive]
  └── Dense(4, softmax) → stage_output   [Normal / Early / Moderate / Advanced]

Design choice: Despite its simplicity, this 3-block CNN significantly outperformed the more complex Hybrid model. The large Flatten+Dense layer (131,072→128) gives it strong discriminative power. The dual-output design allows simultaneous binary detection and staging from a single forward pass, with loss_weights={'binary': 1.0, 'stage': 0.5} prioritizing correct AS detection.

Model 3: Hybrid CNN-Transformer v1 Hybrid

Purpose: Combine EfficientNetB0 CNN features with actual Vision Transformer attention blocks for global context.

Property	Value
Parameters	~5.2M (approximate)
Backbone	EfficientNetB0 (ImageNet, frozen)
Transformer	2 × MultiHeadAttention blocks (4 heads, key_dim=256)
File	classifier_best_model.keras (~162.9 MB)
Notebook Cell	Cell 5 (build_hybrid_cnn_transformer)

Input(256, 256, 1) → Conv2D(3,1,same)           # Grayscale→RGB adapter
  → EfficientNetB0(frozen, imagenet)             # Feature extraction
  → GlobalAveragePooling2D()                     # (batch, 1280)
  → Reshape(1, 1280)                             # Sequence for Transformer
  → TransformerBlock(4 heads, mlp_dim=256)       # Self-attention + MLP + residual
  → TransformerBlock(4 heads, mlp_dim=256)       # Second Transformer layer
  → Flatten()
  → Dense(256, relu) → Dropout(0.3)
  ├── Dense(2, softmax) → binary_output
  └── Dense(4, softmax) → stage_output

Transformer Block internals:

LayerNorm → MultiHeadAttention(4 heads) → Residual Add
LayerNorm → Dense(mlp_dim, relu) → Dense(original_dim) → Residual Add

Performance: Binary: 54.44%, Stage: 49.44% (early stopped at epoch 21/41)

Model 4: Hybrid CNN-Transformer v2 Hybrid

Purpose: Simplified version replacing Transformer blocks with dense layers.

Property	Value
Parameters	4,838,319
Backbone	EfficientNetB0 (ImageNet, frozen)
Post-CNN	Dense layers only (no Transformer blocks)
File	hybrid_cnn_transformer_model.keras (~26.5 MB)
Notebook Cell	Cell 15 (build_hybrid_cnn_transformer_v2)

Input(256, 256, 1) → Conv2D(3,1,same)           # Grayscale→RGB adapter
  → EfficientNetB0(frozen, imagenet)             # Feature extraction
  → GlobalAveragePooling2D()                     # (batch, 1280)
  → Dense(512, relu) → Dropout(0.4)
  → Dense(256, relu) → Dropout(0.3)
  ├── Dense(2, softmax) → binary_output
  └── Dense(4, softmax) → stage_output

Performance: Binary: 52.22%, Stage: 57.78% (early stopped at epoch 16, best epoch 1)

Why Both Hybrid Models Underperformed: The EfficientNetB0 backbone was frozen (base_model.trainable = False), meaning it could not adapt its ImageNet features (trained on natural images like cats, dogs, cars) to the very different domain of medical MRI scans. Additionally, the single-token sequence (1×1280 after GAP) provided minimal benefit from the Transformer attention mechanism. Fine-tuning the last few EfficientNet layers would likely improve results significantly.

All Models Summary

Model	File	Size	Type	Cell	Status
Attention U-Net	attention_unet_model.keras	94.6 MB	Segmentation	3+4	Deployed
U-Net Best Checkpoint	unet_best_model.keras	94.6 MB	Segmentation	4	Deployed
Simple CNN Classifier ★	cnn_classifier_model.keras	202.5 MB	Classifier	7+10	Primary
Hybrid v1 (Transformer)	classifier_best_model.keras	162.9 MB	Classifier	5+6	Needs tuning
Hybrid v2 (Dense)	hybrid_cnn_transformer_model.keras	26.5 MB	Classifier	15	Needs tuning

Total model storage: ~581.1 MB

5Training Configuration & Details

Implementation Steps

Data Loading (Cell 0-2): CSV loaded with pd.read_csv(), images read via cv2.imread(IMREAD_GRAYSCALE), normalized to [0,1], reshaped to (N, 256, 256, 1). Train/test split with stratification.
Feature-Based Relabeling (Cell 8-9): Two rounds of feature extraction (intensity, texture, edges, lower-half intensity) to create balanced, image-characteristic-based AS labels.
Attention U-Net Training (Cell 3-4): Segmentation model trained on image→mask pairs using binary crossentropy.
Hybrid CNN-Transformer v1 Training (Cell 5-6): EfficientNetB0 + Transformer blocks + simplified classifier trained on relabeled data.
Simple CNN Training (Cell 7+10): 3-block CNN with dual heads trained on balanced feature-based labels — achieved best results.
Evaluation (Cell 11): Full classification reports, confusion matrices, IoU computation.
Testing (Cell 12): Visual prediction on random test samples with overlays.
Grad-CAM (Cell 13-14): Standard and focused Grad-CAM heatmap generation and visualization.
Hybrid v2 Training (Cell 15): Simplified hybrid model with dense-only layers.

Training Hyperparameters Comparison

Parameter	Attention U-Net	Simple CNN ★	Hybrid v1	Hybrid v2
Optimizer	Adam (default lr)	Adam (lr=0.001)	Adam (default)	Adam (lr=0.001)
Loss	Binary CE	Sparse Cat. CE (×2)	Sparse Cat. CE (×2)	Sparse Cat. CE (×2)
Loss Weights	N/A	binary:1.0, stage:0.5	binary:1.0, stage:0.5	binary:1.0, stage:0.5
Max Epochs	50	100	50	50
Batch Size	16	32	16	16
EarlyStopping Monitor	val_loss	val_binary_output_accuracy	val_binary_output_accuracy	val_binary_output_accuracy
EarlyStopping Patience	10	20	15	15
ReduceLROnPlateau	Yes (factor=0.5, patience=5)	Yes (factor=0.5, patience=7)	Yes (factor=0.5, patience=5)	Yes (factor=0.5, patience=5)
ModelCheckpoint	Yes (unet_best_model.keras)	No	No	No
Actual Epochs	~50 (full run)	26 (best @ epoch 6)	41 (best @ epoch 21)	16 (best @ epoch 1)

Training Environment

Libraries

Library	Version
Python	3.x (Kaggle)
TensorFlow	2.19.0
Keras	3.10.0
NumPy	2.0.2
Pandas	2.2.2
OpenCV	4.12.0
Scikit-learn	1.6.1
Matplotlib	3.10.0
Seaborn	0.13.2

Hardware

Component	Spec
Platform	Kaggle Notebooks
GPU	1× GPU (T4 / P100)
RAM	~16 GB
CUDA	Enabled
Storage	Kaggle workspace

Model	Type	Parameters	Binary Acc	Stage Acc	IoU	Status
Attention U-Net	Segmentation	7,869,572	89.35% (val)	—	0.5652	✅ Deployed
Simple CNN ★	Classification	16,870,790	96.67%	82.22%	—	✅ Primary
Hybrid v1	Classification	~5.2M	54.44%	49.44%	—	⚠️ Needs tuning
Hybrid v2	Classification	4,838,319	52.22%	57.78%	—	⚠️ Needs tuning

Class	Precision	Recall	F1-Score	Support
AS Negative	0.99	0.95	0.97	94
AS Positive	0.94	0.99	0.97	86
Overall Accuracy	0.97 (96.67%)	180
Macro Avg	0.97	0.97	0.97	180
Weighted Avg	0.97	0.97	0.97	180

Stage	Precision	Recall	F1-Score	Support
Normal (0)	0.87	1.00	0.93	104
Early (1)	0.00	0.00	0.00	18
Moderate (2)	0.61	0.73	0.67	26
Advanced (3)	0.93	0.78	0.85	32
Accuracy	82.22%	180
Macro Avg	0.60	0.63	0.61	180
Weighted Avg	0.75	0.82	0.78	180

7Grad-CAM Explainability

Two Grad-CAM implementations provide visual explanations for the model's classification decisions:

Standard Grad-CAM (Cell 13)

1. Create sub-model: [input] → [last_conv_layer_output, model_predictions]
2. Forward pass with GradientTape
3. Compute gradients: d(predicted_class_score) / d(last_conv_layer_output)
4. Global average pooling of gradients → per-channel importance weights
5. Weighted combination: heatmap = conv_output @ pooled_grads
6. ReLU activation (keep only positive influence) + normalize to [0,1]
7. Resize to input dimensions, apply JET colormap, overlay with alpha=0.4

Region-Focused Grad-CAM — Novel Enhancement (Cell 14)

# After standard Grad-CAM computation, apply sacroiliac joint ROI mask:
roi_y_start = int(height × 0.5)   # Focus on lower 50% of image
roi_y_end   = int(height × 0.9)   # Down to 90% (avoiding edge)
roi_x_start = int(width × 0.3)    # Central 40% horizontally
roi_x_end   = int(width × 0.7)

mask = np.zeros_like(heatmap)
mask[roi_y_start:roi_y_end, roi_x_start:roi_x_end] = 1.0
heatmap_focused = heatmap × mask
heatmap_focused = heatmap_focused / (max(heatmap_focused) + 1e-8)

Target Conv Layer: conv2d_49 (last convolutional layer in the Simple CNN)

Clinical Value: Standard Grad-CAM may highlight any discriminative region including background artifacts. The focused variant ensures explanations align with the sacroiliac joint area — where radiologists actually look for AS signs like bone marrow edema, erosion, and ankylosis.

8End-to-End Prediction Pipeline

📤 Upload MRI

→

🔄 Preprocess

→

🔍 Segment

→

🧠 Classify

→

🔥 Grad-CAM

→

💾 Save to DB

→

📊 Display

Detailed Flow (from `predict.py` and `app.py`):

Upload (app.py) — User uploads PNG/JPG/JPEG via Flask form → saved to uploads/

Preprocessing (predict.py)

img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
img = cv2.resize(img, (256, 256))
img = img / 255.0
img = np.expand_dims(img, axis=(0, -1))  # → shape (1, 256, 256, 1)

Segmentation — Selected model(s) predict mask → threshold at 0.5 → overlay on original (red highlight, alpha=0.3)
Classification — Selected model(s) predict dual outputs:
- binary_output: softmax(2) → argmax → AS Positive/Negative + confidence %
- stage_output: softmax(4) → argmax → Normal/Early/Moderate/Advanced + confidence %
Grad-CAM — Focused heatmap → JET colormap → overlay on MRI (alpha=0.4)
Storage — All outputs saved to static/results/{uuid}/, metadata to SQLite DB
Display — Results page shows original, mask, overlay, Grad-CAM, predictions with confidence

Available Models at Runtime

Key	Display Name	Type	File
attention_unet	Attention U-Net	Segmentation	attention_unet_model.keras
unet_best	U-Net Best	Segmentation	unet_best_model.keras
cnn_classifier	CNN Classifier	Classifier	cnn_classifier_model.keras
classifier_best	Best Classifier	Classifier	classifier_best_model.keras
hybrid_cnn_transformer	Hybrid CNN-Transformer	Classifier	hybrid_cnn_transformer_model.keras

Users can select multiple models simultaneously for side-by-side comparison.

9Comparison with Existing Systems

Key Differentiators: Unlike existing systems that are (1) binary-only, (2) require manual ROI, or (3) lack explainability, our framework provides all three: automatic segmentation + stage-wise classification + visual Grad-CAM evidence.

Study	Dataset	Model	Task	Performance	Limitations
Lee et al. (2023)	296 patients, 4,746 slices	Faster R-CNN + VGG-19	Detect sacroiliitis	AUROC ~0.83, Sens ~0.725, Spec ~0.936	Binary only, no staging
Xie et al. (2025)	1,294 patients, 4 centers	ResNet50 + KNN-11	axSpA classification	AUC ~0.912, Acc ~86.9%	Manual ROI, binary only
Zhou et al. (2024)	485 patients	3D U-Net + ResNet50 + ensemble	Diagnose sacroiliitis	AUC ~0.910, Acc ~85.6%	No stage classification
Bordner et al. (2023)	362 images (DESIR)	Deep Learning	BME & sacroiliitis	—	Binary only
Deep Learning Chris (2023)	326 axSpA + 63 NSBP	Attention U-Net	Segmentation/Detection	AUC ~0.96, Sens ~0.90, Spec ~0.93	Segmentation only, no classification
Kumar et al. (2025)	—	Transfer Learning	AS detection	—	Binary only, no staging
Kocaoglu (2025)	—	FPGA DL	AS detection	—	Hardware-specific
Manikandan et al. (2023)	—	ASNET	AS diagnosis	—	No visual explainability
Our Framework	900 images	Att. U-Net + CNN + Hybrid + Grad-CAM	Seg + Binary + Staging	Binary: 96.67%, Stage: 82.22%, IoU: 0.5652	Fully automated, explainable, web-deployed

Feature Comparison Matrix

Feature	Most Existing Systems	Our Framework
ROI Extraction	Manual / Semi-automatic	✅ Fully automatic (Attention U-Net)
Classification Type	Binary only (AS+/AS-)	✅ Binary + 4-stage severity
Explainability	Probability scores only	✅ Region-focused Grad-CAM heatmaps
Multi-model Comparison	Single model	✅ 5 selectable models
Web Deployment	Research code only	✅ Flask app with auth + history
Clinician-Verified Labels	✅ Radiologist annotations	⚠️ Feature-based (synthetic)
Dataset Size	296–1,294 patients	⚠️ 900 images (single source)
Clinical Validation	✅ Some prospective studies	❌ Not yet clinically validated

Component	Minimum	Recommended
CPU	Intel i5 / AMD Ryzen 5	Intel i7 / AMD Ryzen 7 (multi-core)
GPU	Not required (CPU inference)	NVIDIA with CUDA (GTX 1660+ / RTX 2060+)
RAM	4 GB	16 GB (32 GB for training)
Storage	1 GB (models only)	500 GB SSD (models + dataset + outputs)

Software	Version
OS	Windows 10/11, Linux (Ubuntu 20.04+), macOS
Python	3.8+ (3.12 recommended)
TensorFlow	2.19.0
Keras	3.10.0
OpenCV	4.12.0
Flask	Latest
NumPy / Pandas / Scikit-learn	Latest compatible
Docker (optional)	For containerized deployment

Model File	Size
attention_unet_model.keras	94.6 MB
unet_best_model.keras	94.6 MB
cnn_classifier_model.keras	202.5 MB
classifier_best_model.keras	162.9 MB
hybrid_cnn_transformer_model.keras	26.5 MB
Total	~581.1 MB

11Web Application Architecture

Component	Technology	Details
Backend	Flask (Python)	`app.py` — routes, auth, file handling
ML Inference	TensorFlow/Keras	`predict.py` — preprocessing, prediction, Grad-CAM
Model Management	Custom	`model_loader.py` — lazy loading, caching, model registry
Database	SQLite	`database.py` — users + predictions tables
Auth	Session-based	SHA-256 password hashing, Flask sessions
Frontend	HTML/CSS templates	6 pages: index, login, signup, upload, results, history
File Storage	Local filesystem	`uploads/` for input, `static/results/{uuid}/` for outputs
Port	8000	Configurable
Containerization	Docker	`Dockerfile` + `docker-compose.yml` available

Database Schema

CREATE TABLE users (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    email TEXT UNIQUE NOT NULL,
    password TEXT NOT NULL,       -- SHA-256 hash
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE predictions (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    uuid TEXT UNIQUE,             -- Unique run identifier
    user_id INTEGER NOT NULL,
    image_path TEXT NOT NULL,
    as_status TEXT NOT NULL,      -- "AS Positive" / "AS Negative"
    stage TEXT NOT NULL,           -- "Normal" / "Early" / "Moderate" / "Advanced"
    confidence REAL,               -- Binary confidence %
    stage_confidence REAL,         -- Stage confidence %
    segmentation_mask TEXT,        -- Path to predicted mask image
    gradcam_overlay TEXT,          -- Path to Grad-CAM overlay image
    segmentation_overlay TEXT,     -- Path to segmentation overlay image
    model_results TEXT,            -- JSON of all model outputs
    timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES users(id)
);

Source Files

File	Lines	Purpose
`app.py`	~173	Flask routes: signup, login, upload, predict, history, results
`predict.py`	~202	Core ML: preprocessing, segmentation, classification, Grad-CAM
`model_loader.py`	~57	Model registry, lazy loading, caching
`database.py`	~145	SQLite operations: init, CRUD for users and predictions
`ankylosings.ipynb`	18 cells	Full training notebook: data loading → training → evaluation → Grad-CAM

12Project Status & Areas for Improvement

Current Status

Component	Status	Details
Dataset Collection	Complete	900 images curated from Kaggle lumbar dataset
Feature-Based Labeling	Complete	Balanced labels based on image characteristics
Attention U-Net (Segmentation)	Complete	89.35% val accuracy, IoU 0.5652
Simple CNN Classifier	Complete	96.67% binary, 82.22% stage — deployed as primary
Hybrid CNN-Transformer v1	Needs Improvement	54.44% binary — frozen backbone limits performance
Hybrid CNN-Transformer v2	Needs Improvement	52.22% binary — frozen backbone limits performance
Standard Grad-CAM	Complete	Working for CNN classifier
Focused Grad-CAM	Complete	Sacroiliac joint ROI-focused variant
Flask Web App	Complete	Auth, upload, predict, history — all working
Docker Deployment	Complete	Dockerfile + docker-compose.yml ready
Early Stage Detection	Needs Work	0% recall for Stage 1 (Early) — class imbalance issue
Clinical Validation	Not Started	No radiologist-verified labels or prospective trials
Transformer Attention Maps	Not Started	Visualizing Transformer self-attention patterns

Known Issues & Improvements Needed

Issue	Impact	Proposed Solution
Frozen EfficientNetB0 Backbone	Hybrid models stuck at ~52% (near random)	Unfreeze last 20-30 layers of EfficientNetB0, use differential learning rates (base: 1e-5, head: 1e-3)
Early Stage (Stage 1) 0% Recall	Model cannot detect early AS	Oversample early-stage images (SMOTE/augmentation), use focal loss instead of CE, add class weights
Synthetic Labels	Labels don't reflect true clinical ground truth	Partner with radiology department for expert annotations on subset; validate feature-label correlation
Single MRI Modality	Misses clinical context	Add HLA-B27 status, CRP levels, patient age/gender as additional inputs (multimodal fusion)
2D Slice Analysis Only	Misses inter-slice continuity	Implement 3D CNN or use multiple adjacent slices as input channels
Small Dataset (900 images)	Limited generalization	Expand to multi-center datasets (DESIR, ASAS cohorts), apply heavy augmentation
U-Net IoU = 0.5652	Moderate segmentation quality	Use Dice loss + BCE combined, increase training data, add boundary-aware loss

13Expected Outcomes & Future Scope

Expected Project Outcomes

A complete Explainable AI system for multi-stage AS classification using sacroiliac joint MRI, deployable as a web application.
A hybrid CNN-Transformer model trained to capture both local spatial patterns (bone erosion, sclerosis) and global structural relationships.
Visual explanations (Grad-CAM heatmaps) highlighting the anatomical regions influencing predictions, enabling radiologists to verify AI decisions.
A practical, transparent diagnostic tool that bridges the gap between AI accuracy and clinical usability, supporting faster and more confident AS diagnosis.

Future Scope

Short-Term Improvements

Unfreeze Hybrid Model: Fine-tune EfficientNetB0 backbone layers for medical domain adaptation
Address Class Imbalance: Implement focal loss, SMOTE, or weighted sampling for early-stage detection
Improve Segmentation: Use combined Dice + BCE loss to boost IoU beyond 0.6
Add Confusion Matrix Visualization: Interactive confusion matrices in the web app
Batch Prediction: Enable processing multiple MRI scans in one upload

Medium-Term Goals

Multimodal Integration: Combine MRI data with clinical variables (HLA-B27, CRP, age, gender)
3D Analysis: Process volumetric MRI data to capture inter-slice relationships
Multi-Center Validation: Train on datasets from multiple hospitals to reduce institutional bias
Transformer Attention Maps: Extract and visualize self-attention patterns alongside Grad-CAM
Edge Deployment: Optimize models (quantization, pruning) for inference on mobile/tablet devices

Long-Term Vision

Clinical Trials: Prospective studies measuring real-world impact on radiologist decision-making
Telemedicine Integration: Deploy on cloud platforms for remote screening in underserved areas
Disease Progression Tracking: Longitudinal analysis comparing scans over time for treatment response
Multi-Disease Extension: Extend to other spondyloarthritis conditions and spinal pathologies
DICOM Integration: Direct integration with hospital PACS systems for seamless workflow

14Glossary of Technical Terms

Definitions of key machine learning and medical imaging terms used in this documentation.

Core ML Concepts

Deep Learning (DL): A subset of machine learning using neural networks with many layers (deep) to learn complex patterns from data.
Convolutional Neural Network (CNN): A type of deep learning model specifically designed for image analysis. It uses "filters" to automatically detect features like edges, textures, and shapes.
Transformer: A newer deep learning architecture that uses "attention mechanisms" to weigh the importance of different parts of the input data. Originally for text, now used for images (Vision Transformers).
Epoch: One complete pass of the entire training dataset through the model during training.
Batch Size: The number of training examples used in one iteration to update the model's internal parameters.
Overfitting: When a model learns the training data too well, including noise, and performs poorly on new, unseen data.
Fine-Tuning: Taking a pre-trained model (e.g., trained on millions of general images) and training it further on a specific dataset (e.g., MRI scans) to adapt it to a new task.

Metrics & Evaluation

Accuracy: The percentage of correct predictions made by the model. (Correct / Total).
Precision: "Quality" of positivity. Of all images predicted as Positive, how many were actually Positive?
Recall (Sensitivity): "Quantity" of positivity. Of all actual Positive images, how many did the model correctly find?
F1-Score: The harmonic mean of Precision and Recall. A balanced metric useful when classes are uneven.
IoU (Intersection over Union): A metric for segmentation. Measures the overlap between the predicted mask and the ground truth mask. 0 = no overlap, 1 = perfect match.
Confusion Matrix: A table showing correct and incorrect predictions for each class, helping to see where the model is making mistakes.

Project-Specific Terms

Segmentation: The process of partitioning an image into different regions. Here, separating the sacroiliac joint from the background.
Classification: Categorizing an entire image into a class (e.g., AS Positive or Negative).
ROI (Region of Interest): A specific part of an image identified for further analysis (e.g., the joint area).
Grad-CAM: "Gradient-weighted Class Activation Mapping". A technique to visualize which parts of an image were most important for the model's decision (displayed as a heatmap).
Augmentation: Artificially increasing the training dataset size by creating modified versions of images (rotating, flipping, adding noise) to help the model generalize better.