Knowledge Discovery · Clinical Data Mining · 2,149 Patients

What can data reveal
about Alzheimer's disease
that clinicians might miss?

"Early detection is our most powerful intervention — yet many patients go undiagnosed until it's too late. This study set out to discover hidden patterns in routine clinical data that could improve screening."

93.8%
Model Accuracy
39
Rules Discovered
2,149
Patients
35
Clinical Features
Read the Story ↓ View Code on GitHub ↗
Scroll
2,149
Patients Analyzed
35 clinical & lifestyle features
93.8%
Classification Accuracy
91% sensitivity · 95% specificity
39
Association Rules
confidence up to 84%
5
Analytical Methods
EDA · clustering · rules · trees
Chapter One

Why this matters — and who it affects

Alzheimer's Disease affects over 50 million people worldwide. It is the most common cause of dementia — and one of medicine's most heartbreaking challenges. By the time most patients are diagnosed, the disease has already progressed significantly. Families have noticed the memory slips. The forgotten appointments. The personality changes. But the clinical diagnosis comes late, and the window for early intervention narrows.

The question this study asked is deceptively simple: can routine clinical data — the kind collected at any standard doctor's visit — reveal patterns that help us detect Alzheimer's earlier?

We applied a full knowledge discovery pipeline to a dataset of 2,149 patients spanning demographics, lifestyle, cardiovascular health, cognitive tests, and behavioral assessments. Our approach was systematic: explore, cluster, mine for rules, classify, and — critically — check whether the same features emerge as important across multiple independent methods.

The goal was not just prediction accuracy. It was interpretability — building models that a clinician can understand, audit, and trust in practice.

The Patients

Who is in this dataset?

Before any modeling, we needed to understand the data: who these patients are, how diagnoses are distributed, and which features show the strongest relationships with Alzheimer's.

📊 Figure 1 · Diagnosis Distribution
Diagnosis distribution pie chart
35.4% of patients have Alzheimer's

The dataset has a 1.83:1 class imbalance — more patients without AD than with. This reflects clinical reality and motivated stratified sampling and class-weighted models throughout the analysis.

📊 Figure 2 · Age Distribution by Diagnosis
Age distribution analysis
Age overlaps substantially — it's not the whole story

Both diagnosed and non-diagnosed patients span the full 60–90 age range with similar distributions. This tells us that age alone cannot diagnose Alzheimer's — and motivates the multivariate approach that follows.

🔥 Figure 3 · Full Correlation Matrix — Top 10 Features Correlated with Diagnosis
Correlation heatmap
Functional decline matters more than cognitive tests alone

The top predictors are FunctionalAssessment (r=−0.36) and ADL (r=−0.33) — measures of daily living ability — which outrank even MMSE (r=−0.24). Perhaps most importantly, subjective MemoryComplaints (r=+0.31) ranks third, validating the clinical wisdom of listening to patients and caregivers. No single feature achieves r>0.36 — no single test is enough.

🫀 Figure 4 · Clinical Measurements
Clinical measurements violin plots
Blood pressure & cholesterol are surprisingly weak predictors

Cardiovascular markers show near-identical distributions between diagnosed and non-diagnosed patients. This is a meaningful finding: for Alzheimer's specifically (vs vascular dementia), functional and behavioral measures are far more discriminative than cardiovascular risk factors.

🌿 Figure 5 · Lifestyle Factors
Lifestyle factors analysis
Lifestyle shows weaker direct associations

Smoking, alcohol, diet, and physical activity show minimal differences between groups in this cross-sectional data. Lifestyle factors likely act over decades rather than showing strong single-point effects — but this doesn't diminish their preventive importance.

Chapter Two

Do discrete patient subtypes exist?

Medical literature suggests Alzheimer's may present differently across patients — memory-predominant, executive function-predominant, language-predominant forms. If distinct patient subtypes exist in this data, unsupervised clustering should reveal them.

We tested k-means clustering for k=2 through 7, validating each solution with four independent metrics: silhouette score, Davies-Bouldin index, Calinski-Harabasz score, and inertia. We then validated with hierarchical clustering using Ward linkage.

📉 Figure 6 · Elbow Method + Clustering Validation Metrics (k=2–7)
Elbow method clustering analysis
All four metrics tell the same story: no clear clusters

Silhouette scores of 0.051–0.059 across all k values — far below the 0.3 threshold for meaningful separation. No "elbow" appears in the inertia curve. The Davies-Bouldin index remains high. All metrics converge on the same conclusion: discrete patient subtypes do not exist in this data.

🔵 Figure 7 · PCA Visualization of Clustering Results
PCA visualization
Patients blend together — no separable regions

In 2D PCA projection, even true diagnosis labels (right panel) show complete mixing. There is no region of space that "belongs" to Alzheimer's patients — the disease varies continuously across multiple dimensions simultaneously.

🔵 Figure 8 · K-Means vs Hierarchical Clustering
K-means vs hierarchical clustering comparison
Two algorithms disagree entirely (ARI = 0.025)

If real clusters existed, both algorithms would find similar groupings. An Adjusted Rand Index of 0.025 — near zero — means k-means and hierarchical clustering assign patients almost randomly differently. The clusters are algorithmic artifacts, not real structure.

We reframe this as a meaningful scientific finding, not a failure: Alzheimer's severity in this population lies on a continuum rather than forming discrete patient subtypes. This aligns with the NIA-AA Research Framework and current biological understanding of dementia progression. No clustering method will find what doesn't exist — and recognizing that takes rigour.

Chapter Three

Which combinations of symptoms predict a diagnosis?

Since clustering revealed no discrete groups, we pivoted to a different question: not "what type of patient is this?" but "when these specific symptoms co-occur, how predictive is that combination?" Association rule mining answers exactly this.

Using the Apriori algorithm on discretized clinical features, we discovered 39 interpretable IF-THEN rules predicting Alzheimer's diagnosis, each with at least 60% confidence and lift ≥ 1.5.

The strongest rule: IF MemoryComplaints = 1 AND MMSE_Category = Severe Impairment → AD diagnosis with 84% confidence and 2.36× lift. Patients with this combination are 2.36 times more likely to have Alzheimer's than the average patient in the dataset.

📊 Figure 9 · Most Common Antecedent Items in AD Diagnosis Rules
Association rule mining antecedents
MemoryComplaints dominates — appearing in 51% of all rules

This chart shows how often each feature appears in the 39 high-confidence rules. MemoryComplaints is overwhelmingly the most common antecedent — validating that patient and caregiver-reported memory concerns are not "soft" data but among the strongest clinical signals available. MMSE impairment and BehavioralProblems follow. Notably, the top 3 features are all behavioral or subjective — not blood tests, not imaging.

Chapter Four

Can a machine learning model match clinical judgment?

Association rules provide local, specific patterns. But for a complete diagnostic tool, we need a model that covers every patient — not just those matching a specific symptom combination. Decision trees offer exactly this: a single, unified model that is also fully transparent.

We ran a grid search across 12 hyperparameter configurations. The best model — maximum depth of 5, minimum 10 samples per leaf — achieved 93.8% accuracy, 0.912 F1-score, 91% sensitivity, and 95% specificity. It correctly identified 91% of all Alzheimer's cases in the test set.

For comparison: always predicting "no Alzheimer's" achieves 64.7% accuracy. Logistic regression achieves 81.6%. Our decision tree at 93.8% does this while remaining fully interpretable — clinicians can follow the exact reasoning path for any patient.

🌳 Figure 10 · Top 10 Features from the Decision Tree
Decision tree feature importance
Five features drive 98.4% of the model's predictive power

FunctionalAssessment (23.3%) is the single most important feature — more than MMSE. Together with ADL (18.4%) and MMSE (21.2%), these three functional and cognitive measures account for 62.9% of all predictive information. Orange bars show features that also appear prominently in association rules — providing convergent evidence from two completely independent methods that MemoryComplaints and BehavioralProblems are clinically central.

The Findings

Six things this analysis revealed about Alzheimer's

Each finding has direct implications for how clinicians screen, diagnose, and prioritize patients.

93.8%
Interpretable Models Work
A transparent decision tree outperforms logistic regression by 12 percentage points — proving that black-box models are not required for excellent clinical prediction.
Continuum
AD Has No Discrete Subtypes
All clustering metrics (silhouette ≈ 0.06, ARI = 0.025) confirm that Alzheimer's severity varies continuously in this population — consistent with current biological frameworks.
84%
Best Rule Confidence
When a patient reports memory complaints AND has severe MMSE impairment, there is 84% confidence they have Alzheimer's — a clinically actionable screening guideline.
Functional
Function Outranks Cognition
FunctionalAssessment (23.3% importance) consistently outperforms MMSE (21.2%) across all methods. Assessing daily living ability may be a more sensitive early indicator than cognitive testing alone.
51%
Subjective Reports Matter
MemoryComplaints appears in 51% of all 39 high-confidence rules. Patient and caregiver-reported concerns are not soft anecdotes — they are among the strongest clinical signals available.
2
Convergent Features Validated
MemoryComplaints and BehavioralProblems appear as important in correlation analysis, association rules, AND decision tree splits — three independent methods providing convergent evidence of their clinical centrality.
The Methods

How every number in this study was produced

Every finding is backed by a specific, replicable method. The pipeline moves systematically from exploration to prediction, with each stage building on insights from the last.

Method Question it answered Key result Verdict
Exploratory Data Analysis Which features relate most strongly to diagnosis? FunctionalAssessment r=−0.36 · no single perfect predictor Foundation ✓
K-Means Clustering (k=2–7) Do discrete patient subtypes exist? Silhouette = 0.051–0.059 · no clear elbow Scientific Finding
Hierarchical Clustering Does a different algorithm find stable clusters? ARI = 0.025 vs k-means Confirms Continuum
Association Rules (Apriori) Which symptom combinations predict diagnosis? 39 rules · confidence 60–84% · lift up to 2.36 Actionable ✓
Decision Tree (depth=5) Can we build an interpretable classifier? 93.8% accuracy · 0.912 F1 · 91% sensitivity Excellent ✓
Cross-Method Integration Do findings replicate across independent methods? MemoryComplaints & BehavioralProblems — both convergent Validated ✓
Built With

Tools & Technologies

Python 3.10 scikit-learn 1.3 mlxtend 0.22 pandas 2.1 NumPy 1.24 matplotlib 3.7 seaborn 0.12 Google Colab GitHub

Transparent models can achieve excellent clinical performance.

93.8% accuracy. 91% sensitivity. 95% specificity. All from a decision tree that a clinician can read, audit, and trust — no black box required.

The findings point toward a shift in Alzheimer's screening: prioritize functional assessment alongside cognitive testing, take subjective complaints seriously, and look for the combination of behavioral and cognitive signals that appear across every method in this study.

Primary contributor: Cynthia Mutua · Co-authors: Halee Belghouthi, Fedi Naimi, Jhansi Nalla · CIS 635 · GVSU 2025

View Full Code on GitHub ↗ ← Back to Portfolio