Skip to content

Data Mapping Guide

Transform messy spreadsheet columns into a clean, standardized dataset with AI-assisted mapping.


Overview

Data mapping is the critical step between raw data and analysis. MedTWIN's AI:

  1. Profiles your data (types, distributions, missing values)
  2. Suggests mappings to standard clinical variables
  3. Validates data quality and consistency
  4. Transforms data into analysis-ready format

Upload Process

Supported File Types

Format Max Size Notes
Excel (.xlsx) 100 MB Multiple sheets supported
CSV 100 MB UTF-8 encoding recommended
JSON 50 MB Flat or nested structures

Auto-Profiling

After upload, MedTWIN automatically detects:

📊 Column Profile: "admission_date"
├── Type: Date
├── Format: YYYY-MM-DD (detected)
├── Non-null: 1,245 / 1,247 (99.8%)
├── Unique: 892
├── Range: 2019-01-03 to 2024-11-15
└── Suggested mapping: admission_date

AI Mapping Engine

How It Works

The mapping engine uses:

  1. Column name matching - Semantic similarity to known variables
  2. Data pattern recognition - Values that look like dates, IDs, lab values
  3. Statistical profiling - Distribution shape suggests variable type
  4. Context from other columns - If one column is "patient_id", nearby columns are likely patient data

Confidence Levels

Level Badge Meaning Action
High 🟢 >90% confident Usually accept as-is
Medium 🟡 60-90% confident Review recommended
Low 🔴 <60% confident Manual review required
Unmapped No suggestion Map manually or skip

Standard Variable Library

MedTWIN maps to a standardized schema including:

Demographics - patient_id, age, sex, race, ethnicity - date_of_birth, admission_date, discharge_date

Vitals - height_cm, weight_kg, bmi - systolic_bp, diastolic_bp, heart_rate - temperature, respiratory_rate, spo2

Labs - hba1c, glucose, creatinine, egfr - hemoglobin, wbc, platelets - alt, ast, bilirubin, albumin - sodium, potassium, chloride

Outcomes - mortality_30day, mortality_90day, mortality_1year - readmission_30day, los_days (length of stay) - icu_admission, ventilator_days


Manual Mapping

Edit a Mapping

Click any row in the mapping table to open the editor:

Column: "hgb_value"
├── Current mapping: hemoglobin
├── Confidence: 🟢 High
├── Change to: [dropdown of standard variables]
├── Transform: [optional transformation]
└── Notes: [free text]

Transformations

Apply transformations during mapping:

Transform Example
Unit conversion mg/dL → mmol/L
Date parsing "Jan 5, 2024" → 2024-01-05
Categorical encoding "M"/"F" → 0/1
Text extraction "BP: 120/80" → systolic=120

Skip a Column

For columns you don't need:

  1. Click the column
  2. Select "Do not map"
  3. Column will be excluded from the master dataset

Multi-Sheet Relationships

Automatic Detection

MedTWIN detects relationships between sheets:

Sheet: "Demographics" (1,247 rows)
└── patient_id (primary key)

Sheet: "Lab Results" (8,432 rows)
├── patient_id (foreign key → Demographics)
└── Multiple rows per patient (time series)

Sheet: "Outcomes" (1,247 rows)
└── patient_id (foreign key → Demographics)

Define Relationships Manually

If auto-detection fails:

  1. Go to Relationships tab
  2. Select two sheets
  3. Choose the linking columns
  4. Define relationship type:
  5. One-to-One: Demographics ↔ Outcomes
  6. One-to-Many: Demographics → Lab Results

Validation

Required Fields

Analysis types require specific variables:

Analysis Required Fields
Logistic Regression Outcome (binary), ≥1 predictor
Cox Regression Time variable, Event indicator, ≥1 predictor
Linear Regression Outcome (continuous), ≥1 predictor
Descriptive ≥1 variable

Data Quality Checks

Before committing, MedTWIN validates:

Check Description
Duplicate IDs Each patient_id must be unique (in main table)
Missing required Required fields can't be all NULL
Type consistency Values match declared type
Range validity Age > 0, dates in reasonable range
Referential integrity Foreign keys exist in parent table

Handling Validation Errors

⚠️ Validation Error: 3 issues found

1. Duplicate patient_id
   └── IDs "P001", "P002" appear twice
   └── [View duplicates] [Auto-fix: keep first]

2. Invalid date format
   └── Column "dob" has 12 unparseable values
   └── [View invalid] [Auto-fix: set to NULL]

3. Missing required field
   └── "outcome_30day" is 100% NULL
   └── [Map different column] [Mark as unmapped]

Commit Mapping

Once validation passes:

  1. Click Commit Mapping
  2. MedTWIN creates the Master Dataset
  3. Original data is preserved (non-destructive)
  4. Mapping is version-controlled

What Happens on Commit

1. Apply all transformations
2. Join sheets using defined relationships
3. Create master_dataset table
4. Generate ETL script (for reproducibility)
5. Log to audit trail

Re-mapping

Need to change mappings later?

  1. Go to DataMapping History
  2. Click Create New Mapping Version
  3. Edit and re-commit
  4. Previous analyses still reference old mapping

Best Practices

Before Upload

  • ✅ Remove any PHI from column headers
  • ✅ Use consistent date formats within columns
  • ✅ Include a unique patient/record ID
  • ✅ Document your column naming conventions

During Mapping

  • ✅ Review all 🟡 Medium confidence mappings
  • ✅ Manually verify 🟢 High confidence for critical variables
  • ✅ Add notes explaining non-obvious mappings
  • ✅ Check the data preview for each mapped column

After Commit

  • ✅ Spot-check the master dataset
  • ✅ Download the ETL script for your records
  • ✅ Document any manual transformations applied

Troubleshooting

"Column not detected as date"

Your date format may be non-standard. Options:

  1. Pre-format in Excel before upload
  2. Use the Date parsing transformation
  3. Map as text, transform in analysis

"Too many unmapped columns"

If most columns are unmapped:

  1. Check column headers are in the first row
  2. Ensure headers are descriptive (not "Col1", "Col2")
  3. Use Bulk Map to map similar columns at once

"Validation taking too long"

For large files (>50k rows):

  1. Validation runs in background
  2. You'll get an email when complete
  3. Check Jobs tab for progress

Next Steps