Data Mapping Guide
Transform messy spreadsheet columns into a clean, standardized dataset with AI-assisted mapping.
Overview
Data mapping is the critical step between raw data and analysis. MedTWIN's AI:
- Profiles your data (types, distributions, missing values)
- Suggests mappings to standard clinical variables
- Validates data quality and consistency
- Transforms data into analysis-ready format
Upload Process
Supported File Types
| Format | Max Size | Notes |
|---|---|---|
| Excel (.xlsx) | 100 MB | Multiple sheets supported |
| CSV | 100 MB | UTF-8 encoding recommended |
| JSON | 50 MB | Flat or nested structures |
Auto-Profiling
After upload, MedTWIN automatically detects:
📊 Column Profile: "admission_date"
├── Type: Date
├── Format: YYYY-MM-DD (detected)
├── Non-null: 1,245 / 1,247 (99.8%)
├── Unique: 892
├── Range: 2019-01-03 to 2024-11-15
└── Suggested mapping: admission_date
AI Mapping Engine
How It Works
The mapping engine uses:
- Column name matching - Semantic similarity to known variables
- Data pattern recognition - Values that look like dates, IDs, lab values
- Statistical profiling - Distribution shape suggests variable type
- Context from other columns - If one column is "patient_id", nearby columns are likely patient data
Confidence Levels
| Level | Badge | Meaning | Action |
|---|---|---|---|
| High | 🟢 | >90% confident | Usually accept as-is |
| Medium | 🟡 | 60-90% confident | Review recommended |
| Low | 🔴 | <60% confident | Manual review required |
| Unmapped | ⚪ | No suggestion | Map manually or skip |
Standard Variable Library
MedTWIN maps to a standardized schema including:
Demographics
- patient_id, age, sex, race, ethnicity
- date_of_birth, admission_date, discharge_date
Vitals
- height_cm, weight_kg, bmi
- systolic_bp, diastolic_bp, heart_rate
- temperature, respiratory_rate, spo2
Labs
- hba1c, glucose, creatinine, egfr
- hemoglobin, wbc, platelets
- alt, ast, bilirubin, albumin
- sodium, potassium, chloride
Outcomes
- mortality_30day, mortality_90day, mortality_1year
- readmission_30day, los_days (length of stay)
- icu_admission, ventilator_days
Manual Mapping
Edit a Mapping
Click any row in the mapping table to open the editor:
Column: "hgb_value"
├── Current mapping: hemoglobin
├── Confidence: 🟢 High
│
├── Change to: [dropdown of standard variables]
├── Transform: [optional transformation]
└── Notes: [free text]
Transformations
Apply transformations during mapping:
| Transform | Example |
|---|---|
| Unit conversion | mg/dL → mmol/L |
| Date parsing | "Jan 5, 2024" → 2024-01-05 |
| Categorical encoding | "M"/"F" → 0/1 |
| Text extraction | "BP: 120/80" → systolic=120 |
Skip a Column
For columns you don't need:
- Click the column
- Select "Do not map"
- Column will be excluded from the master dataset
Multi-Sheet Relationships
Automatic Detection
MedTWIN detects relationships between sheets:
Sheet: "Demographics" (1,247 rows)
└── patient_id (primary key)
Sheet: "Lab Results" (8,432 rows)
├── patient_id (foreign key → Demographics)
└── Multiple rows per patient (time series)
Sheet: "Outcomes" (1,247 rows)
└── patient_id (foreign key → Demographics)
Define Relationships Manually
If auto-detection fails:
- Go to Relationships tab
- Select two sheets
- Choose the linking columns
- Define relationship type:
- One-to-One: Demographics ↔ Outcomes
- One-to-Many: Demographics → Lab Results
Validation
Required Fields
Analysis types require specific variables:
| Analysis | Required Fields |
|---|---|
| Logistic Regression | Outcome (binary), ≥1 predictor |
| Cox Regression | Time variable, Event indicator, ≥1 predictor |
| Linear Regression | Outcome (continuous), ≥1 predictor |
| Descriptive | ≥1 variable |
Data Quality Checks
Before committing, MedTWIN validates:
| Check | Description |
|---|---|
| Duplicate IDs | Each patient_id must be unique (in main table) |
| Missing required | Required fields can't be all NULL |
| Type consistency | Values match declared type |
| Range validity | Age > 0, dates in reasonable range |
| Referential integrity | Foreign keys exist in parent table |
Handling Validation Errors
⚠️ Validation Error: 3 issues found
1. Duplicate patient_id
└── IDs "P001", "P002" appear twice
└── [View duplicates] [Auto-fix: keep first]
2. Invalid date format
└── Column "dob" has 12 unparseable values
└── [View invalid] [Auto-fix: set to NULL]
3. Missing required field
└── "outcome_30day" is 100% NULL
└── [Map different column] [Mark as unmapped]
Commit Mapping
Once validation passes:
- Click Commit Mapping
- MedTWIN creates the Master Dataset
- Original data is preserved (non-destructive)
- Mapping is version-controlled
What Happens on Commit
1. Apply all transformations
2. Join sheets using defined relationships
3. Create master_dataset table
4. Generate ETL script (for reproducibility)
5. Log to audit trail
Re-mapping
Need to change mappings later?
- Go to Data → Mapping History
- Click Create New Mapping Version
- Edit and re-commit
- Previous analyses still reference old mapping
Best Practices
Before Upload
- ✅ Remove any PHI from column headers
- ✅ Use consistent date formats within columns
- ✅ Include a unique patient/record ID
- ✅ Document your column naming conventions
During Mapping
- ✅ Review all 🟡 Medium confidence mappings
- ✅ Manually verify 🟢 High confidence for critical variables
- ✅ Add notes explaining non-obvious mappings
- ✅ Check the data preview for each mapped column
After Commit
- ✅ Spot-check the master dataset
- ✅ Download the ETL script for your records
- ✅ Document any manual transformations applied
Troubleshooting
"Column not detected as date"
Your date format may be non-standard. Options:
- Pre-format in Excel before upload
- Use the Date parsing transformation
- Map as text, transform in analysis
"Too many unmapped columns"
If most columns are unmapped:
- Check column headers are in the first row
- Ensure headers are descriptive (not "Col1", "Col2")
- Use Bulk Map to map similar columns at once
"Validation taking too long"
For large files (>50k rows):
- Validation runs in background
- You'll get an email when complete
- Check Jobs tab for progress