Data Mapping Guide

Transform messy spreadsheet columns into a clean, standardized dataset with AI-assisted mapping.

Overview

Data mapping is the critical step between raw data and analysis. MedTWIN's AI:

Profiles your data (types, distributions, missing values)
Suggests mappings to standard clinical variables
Validates data quality and consistency
Transforms data into analysis-ready format

Upload Process

Supported File Types

Format	Max Size	Notes
Excel (.xlsx)	100 MB	Multiple sheets supported
CSV	100 MB	UTF-8 encoding recommended
JSON	50 MB	Flat or nested structures

Auto-Profiling

After upload, MedTWIN automatically detects:

📊 Column Profile: "admission_date"
├── Type: Date
├── Format: YYYY-MM-DD (detected)
├── Non-null: 1,245 / 1,247 (99.8%)
├── Unique: 892
├── Range: 2019-01-03 to 2024-11-15
└── Suggested mapping: admission_date

AI Mapping Engine

How It Works

The mapping engine uses:

Column name matching - Semantic similarity to known variables
Data pattern recognition - Values that look like dates, IDs, lab values
Statistical profiling - Distribution shape suggests variable type
Context from other columns - If one column is "patient_id", nearby columns are likely patient data

Confidence Levels

Level	Badge	Meaning	Action
High	🟢	>90% confident	Usually accept as-is
Medium	🟡	60-90% confident	Review recommended
Low	🔴	<60% confident	Manual review required
Unmapped	⚪	No suggestion	Map manually or skip

Standard Variable Library

MedTWIN maps to a standardized schema including:

Demographics - patient_id, age, sex, race, ethnicity - date_of_birth, admission_date, discharge_date

Vitals - height_cm, weight_kg, bmi - systolic_bp, diastolic_bp, heart_rate - temperature, respiratory_rate, spo2

Labs - hba1c, glucose, creatinine, egfr - hemoglobin, wbc, platelets - alt, ast, bilirubin, albumin - sodium, potassium, chloride

Outcomes - mortality_30day, mortality_90day, mortality_1year - readmission_30day, los_days (length of stay) - icu_admission, ventilator_days

Manual Mapping

Edit a Mapping

Click any row in the mapping table to open the editor:

Column: "hgb_value"
├── Current mapping: hemoglobin
├── Confidence: 🟢 High
│
├── Change to: [dropdown of standard variables]
├── Transform: [optional transformation]
└── Notes: [free text]

Transformations

Apply transformations during mapping:

Transform	Example
Unit conversion	mg/dL → mmol/L
Date parsing	"Jan 5, 2024" → 2024-01-05
Categorical encoding	"M"/"F" → 0/1
Text extraction	"BP: 120/80" → systolic=120

Skip a Column

For columns you don't need:

Click the column
Select "Do not map"
Column will be excluded from the master dataset

Multi-Sheet Relationships

Automatic Detection

MedTWIN detects relationships between sheets:

Sheet: "Demographics" (1,247 rows)
└── patient_id (primary key)

Sheet: "Lab Results" (8,432 rows)
├── patient_id (foreign key → Demographics)
└── Multiple rows per patient (time series)

Sheet: "Outcomes" (1,247 rows)
└── patient_id (foreign key → Demographics)

Define Relationships Manually

If auto-detection fails:

Go to Relationships tab
Select two sheets
Choose the linking columns
Define relationship type:
One-to-One: Demographics ↔ Outcomes
One-to-Many: Demographics → Lab Results

Validation

Required Fields

Analysis types require specific variables:

Analysis	Required Fields
Logistic Regression	Outcome (binary), ≥1 predictor
Cox Regression	Time variable, Event indicator, ≥1 predictor
Linear Regression	Outcome (continuous), ≥1 predictor
Descriptive	≥1 variable

Data Quality Checks

Before committing, MedTWIN validates:

Check	Description
Duplicate IDs	Each patient_id must be unique (in main table)
Missing required	Required fields can't be all NULL
Type consistency	Values match declared type
Range validity	Age > 0, dates in reasonable range
Referential integrity	Foreign keys exist in parent table

Handling Validation Errors

⚠️ Validation Error: 3 issues found

1. Duplicate patient_id
   └── IDs "P001", "P002" appear twice
   └── [View duplicates] [Auto-fix: keep first]

2. Invalid date format
   └── Column "dob" has 12 unparseable values
   └── [View invalid] [Auto-fix: set to NULL]

3. Missing required field
   └── "outcome_30day" is 100% NULL
   └── [Map different column] [Mark as unmapped]

Commit Mapping

Once validation passes:

Click Commit Mapping
MedTWIN creates the Master Dataset
Original data is preserved (non-destructive)
Mapping is version-controlled

What Happens on Commit

1. Apply all transformations
2. Join sheets using defined relationships
3. Create master_dataset table
4. Generate ETL script (for reproducibility)
5. Log to audit trail

Re-mapping

Need to change mappings later?

Go to Data → Mapping History
Click Create New Mapping Version
Edit and re-commit
Previous analyses still reference old mapping

Best Practices

Before Upload

✅ Remove any PHI from column headers
✅ Use consistent date formats within columns
✅ Include a unique patient/record ID
✅ Document your column naming conventions

During Mapping

✅ Review all 🟡 Medium confidence mappings
✅ Manually verify 🟢 High confidence for critical variables
✅ Add notes explaining non-obvious mappings
✅ Check the data preview for each mapped column

After Commit

✅ Spot-check the master dataset
✅ Download the ETL script for your records
✅ Document any manual transformations applied

Troubleshooting

"Column not detected as date"

Your date format may be non-standard. Options:

Pre-format in Excel before upload
Use the Date parsing transformation
Map as text, transform in analysis

"Too many unmapped columns"

If most columns are unmapped:

Check column headers are in the first row
Ensure headers are descriptive (not "Col1", "Col2")
Use Bulk Map to map similar columns at once

"Validation taking too long"

For large files (>50k rows):

Validation runs in background
You'll get an email when complete
Check Jobs tab for progress