DS4200  ·  Northeastern University  ·  Spring 2026

Visualizing the Complexity
of Genetic Disorders

An interactive multi-variate analysis system exploring the relationships between genetic mutations, inheritance patterns, and patient outcomes through dynamic visualization.

22K Patient Records
45 Features
5 Visualizations
3 Disorder Types

Introduction

Project Overview & Tasks

Genetic disorders affect millions of people worldwide and represent a critical area of medical research. Understanding the relationships between genetic mutations, inheritance patterns, and patient outcomes is essential for advancing diagnosis, treatment, and genetic counseling.

The multidimensional nature of genomic data makes identifying patterns challenging through traditional methods. This project builds an interactive visualization system that enables researchers and clinicians to explore genetic disorder data dynamically.

We examine quantitative variables, such as blood cell counts and patient age, alongside qualitative factors like disorder type, inheritance pattern, and environmental risk factors. Our goal is to make complex genomic data more accessible and actionable for medical researchers and students.

The project uses the Kaggle AI Buzz Genetic Disorders training dataset, which contains approximately 22,000 patient records with 45 features spanning demographic information, clinical measurements, and genetic disorder classifications.

Task 01

Clinical Measurements vs. Disorder Type

Do biomarkers like blood cell count and white blood cell count vary across the three genetic disorder categories? Can these values help identify diagnostic patterns?

Task 02

Inheritance Patterns & Outcomes

How do maternal vs. paternal inheritance pathways connect to specific disorder types? Are certain inheritance routes more strongly associated with particular disorders?

Task 03

Demographic & Environmental Risk

How do parental substance abuse history and serious maternal illness influence the type and severity of genetic disorders in patients?


About the Data

The dataset used in this project is the AI Buzz Genetic Disorders Dataset, obtained from Kaggle. Specifically, we use the training file train_genetic_disorders.csv.

The raw dataset contains approximately 22,000 patient records with 45 features covering demographic information (patient and parental ages, gender), clinical measurements (blood cell count, white blood cell count), genetic disorder classifications (3 disorder types, 9 subclasses), and patient outcomes (status, birth defects). After removing rows with missing values, the cleaned dataset contains approximately 5,000 records, which is used for all visualizations.

Because some columns, specifically the test and symptom columns (test 1–5, symptoms 1–5), are anonymized with binary values (0 or 1), we engineered a new feature called symptom severity score, calculated by summing the number of symptoms a patient tested positive for. This derived variable is used across several visualizations.


Category Variables Used For
Quantitative Patient age, parental ages, blood cell count, white blood cell count, previous abortions Scatter plots, box plots, distribution analysis
Categorical Disorder type (3), subclass (9), gender, inheritance source, birth defects, blood test results Grouping, filtering, comparative analysis
Engineered Symptom severity score — sum of positive symptoms across symptoms 1–5 All visualizations

🧬   Source

Kaggle — AI Buzz Genetic Disorders Dataset (train_genetic_disorders.csv)

📋   Raw Size

~22,000 rows × 45 columns

🔬   After Cleaning

~5,000 rows after removing missing values

📊   Disorder Types

3 genetic disorder types, 9 subclasses

🔒   Privacy Note

Test and symptom columns are anonymized; only a derived severity score is shown.

Visualizations

Data Visualizations

Five visualizations addressing all three analysis tasks, built with Altair, D3.js, and Plotly. Each visualization includes a takeaway explaining key findings.

Plot 1  ·  Altair  ·  Static + Interactive

Blood Cell Count vs. White Blood Cell Count by Disorder Type - Scatter Plot

Interaction: Drag on the scatter plot to select a region of interest. All four charts below update automatically to show the composition of your selection. The top two bar charts reveal the distribution of test results and clusters in the selected area, while the bottom two charts break down cluster membership by test result and gender. This linked interaction enables exploration of relationships between blood cell measurements, cluster assignments, test outcomes, and demographic patterns.
Plot 2  ·  Altair  ·  Static + Interactive

Disorder Count by Gender and Inheritance Source - Grouped Bar Chart

Interaction: Click any test result bar in the top gender panels to highlight that result across all views. The bottom inheritance panels automatically filter to show only the selected test result, revealing how inheritance patterns (Both, Maternal, Paternal, Neither) distribute for that specific outcome. The linked highlighting enables comparison of gender and inheritance patterns for each test result category. Click on empty space to reset the filter and view all data.
Plot 3  ·  D3.js  ·  Interactive

Ancestral and Parental Inheritance to Genetic Disorder - Sankey Diagram

View
Interaction: Use the dropdown to toggle between the full 4-layer ancestral view and a simplified 3-layer parental view. Hover any flow to see exact counts. Nodes can be dragged to reposition.
Plot 4  ·  D3.js  ·  Interactive

Hierarchical Inheritance and Disorder Classification — Tree Diagram

Interaction: Click any node to zoom into that branch and reveal deeper levels, including disorder subclasses. Click outside the tree to go back to where you were. Hover over any node to see patient counts and percentage breakdowns.
Plot 5  ·  D3.js  ·  Interactive

Blood Cell Count Distribution by Genetic Disorders - Boxplot

Blood Metrics
Interaction: Use drop down menu to select between blood cell count and white blood cell count

Summary

Findings & Future Work

What we learned from visualizing the genetic disorder dataset, and directions for future exploration.

🔍   Key Findings

Task 1: Clinical Measurements vs. Disorder Type
K-means clustering identified four biomarker patterns. Clusters 3–4 showed high WBC counts (~9-10k/μL), Clusters 1–2 showed low counts (~5-6k/μL). Test results distributed evenly across clusters, indicating blood measurements alone cannot distinguish disorder types.

Task 2: Inheritance Patterns & Outcomes
Gender and inheritance patterns showed no strong association with test outcomes. All groups exhibited similar distributions, suggesting these factors alone don't predict disorder outcomes.

Task 3: Demographic & Environmental Risk
Blood counts showed no distinct patterns after controlling for maternal illness, radiation, and substance abuse, suggesting minimal impact or confounding variables.

Additional Insights
Mitochondrial disorders more frequently show maternal history; Single Gene disorders show more patients without parental history. Disorders don't follow straightforward inheritance paths.

🔭   Limitations & Future Work

Limitations
Data: Dataset reduced from 22K to 5K records after removing missing values. Anonymized features limit interpretation. Cross-sectional data prevents temporal analysis.

Methods: K-means assumes spherical clusters. Only two biomarkers used. Associations don't establish causation; unknown confounders may influence results.

Generalizability: Kaggle dataset may not represent clinical populations. Findings require real-world validation.

Future Work
• Incorporate symptom severity into clustering analysis
• Apply advanced clustering (hierarchical, DBSCAN) for non-spherical patterns
• Build predictive models (Random Forest, XGBoost) for disorder classification
• Add temporal analysis if longitudinal data available
• Integrate genetic mutation data to map mutations to biomarker clusters


References

Related Work

Academic sources and datasets that informed this project.


Who Built This

D

Dalia

Data & Visualization
  • Data collection & exploration
  • Plot 1 — Altair scatter plot
  • Plot 2 — Altair grouped bar chart
  • Website layout for plots 1 & 2
  • Introduction & related work (report)
Y

Yuqi

Engineering & Visualization
  • Data processing & feature engineering
  • Plot 3 — D3.js Sankey diagram
  • Plot 5 — D3.js grouped box plot
  • Website implementation for plots 3 & 5
  • Data description & analysis plan (report)
J

Jeff

QA & Integration
  • Data validation & quality assurance
  • Plot 4 — D3.js tree diagram
  • Website integration & interactivity
  • Team coordination & final presentation

All members: collaborative design decisions, user testing, debugging, and documentation.