DS4200 · Northeastern University · Spring 2026
An interactive multi-variate analysis system exploring the relationships between genetic mutations, inheritance patterns, and patient outcomes through dynamic visualization.
Introduction
Genetic disorders affect millions of people worldwide and represent a critical area of medical research. Understanding the relationships between genetic mutations, inheritance patterns, and patient outcomes is essential for advancing diagnosis, treatment, and genetic counseling.
The multidimensional nature of genomic data makes identifying patterns challenging through traditional methods. This project builds an interactive visualization system that enables researchers and clinicians to explore genetic disorder data dynamically.
We examine quantitative variables, such as blood cell counts and patient age, alongside qualitative factors like disorder type, inheritance pattern, and environmental risk factors. Our goal is to make complex genomic data more accessible and actionable for medical researchers and students.
The project uses the Kaggle AI Buzz Genetic Disorders training dataset, which contains approximately 22,000 patient records with 45 features spanning demographic information, clinical measurements, and genetic disorder classifications.
Do biomarkers like blood cell count and white blood cell count vary across the three genetic disorder categories? Can these values help identify diagnostic patterns?
How do maternal vs. paternal inheritance pathways connect to specific disorder types? Are certain inheritance routes more strongly associated with particular disorders?
How do parental substance abuse history and serious maternal illness influence the type and severity of genetic disorders in patients?
Dataset
The dataset used in this project is the AI Buzz Genetic Disorders Dataset,
obtained from Kaggle.
Specifically, we use the training file train_genetic_disorders.csv.
The raw dataset contains approximately 22,000 patient records with 45 features covering demographic information (patient and parental ages, gender), clinical measurements (blood cell count, white blood cell count), genetic disorder classifications (3 disorder types, 9 subclasses), and patient outcomes (status, birth defects). After removing rows with missing values, the cleaned dataset contains approximately 5,000 records, which is used for all visualizations.
Because some columns, specifically the test and symptom columns (test 1–5, symptoms 1–5), are anonymized with binary values (0 or 1), we engineered a new feature called symptom severity score, calculated by summing the number of symptoms a patient tested positive for. This derived variable is used across several visualizations.
| Category | Variables | Used For |
|---|---|---|
| Quantitative | Patient age, parental ages, blood cell count, white blood cell count, previous abortions | Scatter plots, box plots, distribution analysis |
| Categorical | Disorder type (3), subclass (9), gender, inheritance source, birth defects, blood test results | Grouping, filtering, comparative analysis |
| Engineered | Symptom severity score — sum of positive symptoms across symptoms 1–5 | All visualizations |
Visualizations
Five visualizations addressing all three analysis tasks, built with Altair, D3.js, and Plotly. Each visualization includes a takeaway explaining key findings.
Summary
What we learned from visualizing the genetic disorder dataset, and directions for future exploration.
Task 1: Clinical Measurements vs. Disorder Type
K-means clustering identified four biomarker patterns. Clusters 3–4 showed high WBC counts (~9-10k/μL), Clusters 1–2 showed low counts (~5-6k/μL). Test results distributed evenly across clusters, indicating blood measurements alone cannot distinguish disorder types.
Task 2: Inheritance Patterns & Outcomes
Gender and inheritance patterns showed no strong association with test outcomes. All groups exhibited similar distributions, suggesting these factors alone don't predict disorder outcomes.
Task 3: Demographic & Environmental Risk
Blood counts showed no distinct patterns after controlling for maternal illness, radiation, and substance abuse, suggesting minimal impact or confounding variables.
Additional Insights
Mitochondrial disorders more frequently show maternal history; Single Gene disorders show more patients without parental history. Disorders don't follow straightforward inheritance paths.
Limitations
Data: Dataset reduced from 22K to 5K records after removing missing values. Anonymized features limit interpretation. Cross-sectional data prevents temporal analysis.
Methods: K-means assumes spherical clusters. Only two biomarkers used. Associations don't establish causation; unknown confounders may influence results.
Generalizability: Kaggle dataset may not represent clinical populations. Findings require real-world validation.
Future Work
• Incorporate symptom severity into clustering analysis
• Apply advanced clustering (hierarchical, DBSCAN) for non-spherical patterns
• Build predictive models (Random Forest, XGBoost) for disorder classification
• Add temporal analysis if longitudinal data available
• Integrate genetic mutation data to map mutations to biomarker clusters
References
Academic sources and datasets that informed this project.
The Team
All members: collaborative design decisions, user testing, debugging, and documentation.