Visualizing Genetic Disorders

Introduction

Project Overview & Tasks

Genetic disorders affect millions of people worldwide and represent a critical area of medical research. Understanding the relationships between genetic mutations, inheritance patterns, and patient outcomes is essential for advancing diagnosis, treatment, and genetic counseling.

The multidimensional nature of genomic data makes identifying patterns challenging through traditional methods. This project builds an interactive visualization system that enables researchers and clinicians to explore genetic disorder data dynamically.

We examine quantitative variables, such as blood cell counts and patient age, alongside qualitative factors like disorder type, inheritance pattern, and environmental risk factors. Our goal is to make complex genomic data more accessible and actionable for medical researchers and students.

The project uses the Kaggle AI Buzz Genetic Disorders training dataset, which contains approximately 22,000 patient records with 45 features spanning demographic information, clinical measurements, and genetic disorder classifications.

Task 01

Clinical Measurements vs. Disorder Type

Do biomarkers like blood cell count and white blood cell count vary across the three genetic disorder categories? Can these values help identify diagnostic patterns?

Task 02

Inheritance Patterns & Outcomes

How do maternal vs. paternal inheritance pathways connect to specific disorder types? Are certain inheritance routes more strongly associated with particular disorders?

Task 03

Demographic & Environmental Risk

How do parental substance abuse history and serious maternal illness influence the type and severity of genetic disorders in patients?

Dataset

About the Data

The dataset used in this project is the AI Buzz Genetic Disorders Dataset, obtained from Kaggle. Specifically, we use the training file train_genetic_disorders.csv.

The raw dataset contains approximately 22,000 patient records with 45 features covering demographic information (patient and parental ages, gender), clinical measurements (blood cell count, white blood cell count), genetic disorder classifications (3 disorder types, 9 subclasses), and patient outcomes (status, birth defects). After removing rows with missing values, the cleaned dataset contains approximately 5,000 records, which is used for all visualizations.

Because some columns, specifically the test and symptom columns (test 1–5, symptoms 1–5), are anonymized with binary values (0 or 1), we engineered a new feature called symptom severity score, calculated by summing the number of symptoms a patient tested positive for. This derived variable is used across several visualizations.

Category	Variables	Used For
Quantitative	Patient age, parental ages, blood cell count, white blood cell count, previous abortions	Scatter plots, box plots, distribution analysis
Categorical	Disorder type (3), subclass (9), gender, inheritance source, birth defects, blood test results	Grouping, filtering, comparative analysis
Engineered	Symptom severity score — sum of positive symptoms across symptoms 1–5	All visualizations

🧬 Source

Kaggle — AI Buzz Genetic Disorders Dataset (train_genetic_disorders.csv)

📋 Raw Size

~22,000 rows × 45 columns

🔬 After Cleaning

~5,000 rows after removing missing values

📊 Disorder Types

3 genetic disorder types, 9 subclasses

🔒 Privacy Note

Test and symptom columns are anonymized; only a derived severity score is shown.

Visualizations

Data Visualizations

Five visualizations addressing all three analysis tasks, built with Altair, D3.js, and Plotly. Each visualization includes a takeaway explaining key findings.

Plot 1 · Altair · Static + Interactive

Blood Cell Count vs. White Blood Cell Count by Disorder Type - Scatter Plot

Interaction: Drag on the scatter plot to select a region of interest. All four charts below update automatically to show the composition of your selection. The top two bar charts reveal the distribution of test results and clusters in the selected area, while the bottom two charts break down cluster membership by test result and gender. This linked interaction enables exploration of relationships between blood cell measurements, cluster assignments, test outcomes, and demographic patterns.

Takeaway

K-Means clustering reveals four distinct biomarker patterns in the blood cell data. Clusters 3 and 4 (upper region) show elevated white blood cell counts (~9-10 thousand/μL), while Clusters 1 and 2 (lower region) exhibit lower counts (~5-6 thousand/μL). However, test results (abnormal, normal, slightly abnormal, inconclusive) are distributed relatively evenly across all clusters, suggesting that these blood cell measurements alone may not be sufficient diagnostic indicators for distinguishing disorder types. The cluster composition charts show similar proportions of all test results in each cluster, indicating that additional biomarkers beyond blood and white blood cell counts would be needed for accurate disorder classification. New to clustering? Learn more about how K-means clustering works or read about interpreting white blood cell counts. This addresses Task 1 by revealing that while natural groupings exist in the biomarker space, they do not correspond strongly to disorder outcomes.

Altair Task 1 Dalia

Plot 2 · Altair · Static + Interactive

Disorder Count by Gender and Inheritance Source - Grouped Bar Chart

Interaction: Click any test result bar in the top gender panels to highlight that result across all views. The bottom inheritance panels automatically filter to show only the selected test result, revealing how inheritance patterns (Both, Maternal, Paternal, Neither) distribute for that specific outcome. The linked highlighting enables comparison of gender and inheritance patterns for each test result category. Click on empty space to reset the filter and view all data.

Takeaway

The analysis reveals relatively balanced distributions of test results across both gender and inheritance patterns. All three gender groups (Male, Female, Ambiguous) show similar proportions of abnormal, normal, slightly abnormal, and inconclusive results, with each test result category containing 300-380 patients per gender. Similarly, inheritance patterns (Both, Maternal, Paternal, Neither) do not show strong associations with specific test outcomes—all four inheritance sources exhibit comparable distributions across test result categories. The "Neither" category (no inheritance from either parent) is the most common pattern, followed by "Maternal," "Paternal," and "Both." These findings suggest that gender and inheritance source alone are not strong predictors of disorder outcomes in this dataset. This addresses Task 2 by revealing that while inheritance patterns vary, they do not correspond to distinct disorder profiles based on the available test results.

Altair Task 2 Dalia

Plot 3 · D3.js · Interactive

Ancestral and Parental Inheritance to Genetic Disorder - Sankey Diagram

View

Interaction: Use the dropdown to toggle between the full 4-layer ancestral view and a simplified 3-layer parental view. Hover any flow to see exact counts. Nodes can be dragged to reposition.

Takeaway

From the sankey diagram, more patients with Leigh Syndrome and Mitochondrial myopathy, which both belong in the class of Mitochondrial genetic inheritance disorders, have maternal genetic disorder history. Whereas for single gene genetic disorders, patients without any parental genetic disorder history outnumber patients with maternal genetic disorder history.

Plotly Task 2 Yuqi

Plot 4 · D3.js · Interactive

Hierarchical Inheritance and Disorder Classification — Tree Diagram

Interaction: Click any node to zoom into that branch and reveal deeper levels, including disorder subclasses. Click outside the tree to go back to where you were. Hover over any node to see patient counts and percentage breakdowns.

Takeaway

The tree reveals that genetic disorders don't always follow straightforward inheritance paths. Some disorders frequently appear under unexpected pathways, such as a gene running on the maternal side but being passed down by the father, suggesting both parents can be silent carriers. This insight could help genetic counselors identify hidden risk in families where neither parent shows symptoms, highlighting the importance of screening beyond surface-level family history.

D3.js Tasks 2 & 3 Jeff

Plot 5 · D3.js · Interactive

Blood Cell Count Distribution by Genetic Disorders - Boxplot

Blood Metrics

Interaction: Use drop down menu to select between blood cell count and white blood cell count

Summary

Findings & Future Work

What we learned from visualizing the genetic disorder dataset, and directions for future exploration.

🔍 Key Findings

Task 1: Clinical Measurements vs. Disorder Type
K-means clustering identified four biomarker patterns. Clusters 3–4 showed high WBC counts (~9-10k/μL), Clusters 1–2 showed low counts (~5-6k/μL). Test results distributed evenly across clusters, indicating blood measurements alone cannot distinguish disorder types.

Task 2: Inheritance Patterns & Outcomes
Gender and inheritance patterns showed no strong association with test outcomes. All groups exhibited similar distributions, suggesting these factors alone don't predict disorder outcomes.

Task 3: Demographic & Environmental Risk
Blood counts showed no distinct patterns after controlling for maternal illness, radiation, and substance abuse, suggesting minimal impact or confounding variables.

Additional Insights
Mitochondrial disorders more frequently show maternal history; Single Gene disorders show more patients without parental history. Disorders don't follow straightforward inheritance paths.

🔭 Limitations & Future Work

Limitations
Data: Dataset reduced from 22K to 5K records after removing missing values. Anonymized features limit interpretation. Cross-sectional data prevents temporal analysis.

Methods: K-means assumes spherical clusters. Only two biomarkers used. Associations don't establish causation; unknown confounders may influence results.

Generalizability: Kaggle dataset may not represent clinical populations. Findings require real-world validation.

Future Work
• Incorporate symptom severity into clustering analysis
• Apply advanced clustering (hierarchical, DBSCAN) for non-spherical patterns
• Build predictive models (Random Forest, XGBoost) for disorder classification
• Add temporal analysis if longitudinal data available
• Integrate genetic mutation data to map mutations to biomarker clusters

References

Related Work

Academic sources and datasets that informed this project.

Glusman, G., Caballero, J., Mauldin, D. E., Hood, L., & Roach, J. C. (2011). Kaviar: an accessible system for testing SNV novelty. Bioinformatics, 27(22), 3216–3217.
https://doi.org/10.1093/bioinformatics/btr540
This paper presents a system for analyzing single nucleotide variants (SNVs) in genetic data, informing how we present mutation data across different disorder types in our system.
Schriml, L. M., Mitraka, E., Munro, J., et al. (2019). Human Disease Ontology 2018 update: Classification, content and workflow expansion. Nucleic Acids Research, 47(D1), D955–D962.
https://doi.org/10.1093/nar/gky1032
This work on disease classification and ontology provides the framework we use for organizing and filtering between different disorder types and subclasses.
AI Buzz Genetic Disorders Dataset (2023). Kaggle.
https://www.kaggle.com/datasets/aibuzz/predict-the-genetic-disorders-datasetof-genomes?select=train_genetic_disorders.csv
The primary dataset used in this project, containing ~22,000 patient records with 45 features covering demographics, clinical measurements, and genetic disorder classifications.
Bostock, M., Ogievetsky, V., & Heer, J. (2011). D³: Data-Driven Documents. IEEE Transactions on Visualization and Computer Graphics, 17(12), 2301–2309.
https://doi.org/10.1109/TVCG.2011.185
The foundational paper for D3.js, the JavaScript library used to build the Sankey diagram, box plot, and tree diagram in this project.

Visualizing the Complexity
of Genetic Disorders

Project Overview & Tasks

Clinical Measurements vs. Disorder Type

Inheritance Patterns & Outcomes

Demographic & Environmental Risk

About the Data

🧬 Source

📋 Raw Size

🔬 After Cleaning

📊 Disorder Types

🔒 Privacy Note

Data Visualizations

Blood Cell Count vs. White Blood Cell Count by Disorder Type - Scatter Plot

Takeaway

Disorder Count by Gender and Inheritance Source - Grouped Bar Chart

Takeaway

Ancestral and Parental Inheritance to Genetic Disorder - Sankey Diagram

Takeaway

Hierarchical Inheritance and Disorder Classification — Tree Diagram

Takeaway

Blood Cell Count Distribution by Genetic Disorders - Boxplot

Takeaway

Findings & Future Work

🔍 Key Findings

🔭 Limitations & Future Work

Related Work

Who Built This

Dalia

Yuqi

Jeff

Visualizing the Complexityof Genetic Disorders

Project Overview & Tasks

Clinical Measurements vs. Disorder Type

Inheritance Patterns & Outcomes

Demographic & Environmental Risk

About the Data

🧬 Source

📋 Raw Size

🔬 After Cleaning

📊 Disorder Types

🔒 Privacy Note

Data Visualizations

Blood Cell Count vs. White Blood Cell Count by Disorder Type - Scatter Plot

Takeaway

Disorder Count by Gender and Inheritance Source - Grouped Bar Chart

Takeaway

Ancestral and Parental Inheritance to Genetic Disorder - Sankey Diagram

Takeaway

Hierarchical Inheritance and Disorder Classification — Tree Diagram

Takeaway

Blood Cell Count Distribution by Genetic Disorders - Boxplot

Takeaway

Findings & Future Work

🔍 Key Findings

🔭 Limitations & Future Work

Related Work

Who Built This

Dalia

Yuqi

Jeff

Visualizing the Complexity
of Genetic Disorders