Challenges in Relational Multi-Table Synthetic Data Generation

1. Introduction

Synthetic data generation is increasingly important when working with sensitive or regulated datasets. While generating synthetic data for single tables is straightforward using GANs or statistical models, generating relational multi-table synthetic data is significantly more complex.

Relational databases do not exist in isolation. They contain relationships that define how information flows across the system:

Foreign keys (parent → child)
Many-to-one (Student → Fees)
One-to-many (Teacher → Courses)
Many-to-many join tables (Student ↔ Courses)
Deep multi-level dependencies (Student → StudentCourses → Marks → Rank)

Every table influences others, and these dependencies must be preserved in the synthetic version. This combination of relational structure + statistical realism makes multi-table synthetic generation one of the toughest challenges in modern data science.

2. Problems With Relational Multi-Table Synthetic Data

Relational synthetic data must simultaneously maintain two distinct but equally critical properties.

A. Structural Correctness

Every foreign key must point to a valid parent record:

No orphan rows
No mismatched or invalid IDs
Correct table row counts and link consistency

B. Statistical Realism

Beyond structural correctness, the synthetic dataset must behave like the real dataset:

Distributions of numeric values (mean, variance, skewness)
Categorical patterns and frequencies
Joint relationships between columns
Cross-table correlations
Behavioral patterns (e.g., students with more courses tend to pay more fees)
Preserving cardinality patterns (fees per student, courses per teacher)

Most simple generators fail because they satisfy either structure or realism, but not both. GANs excel at statistical realism but know nothing about foreign key rules. Statistical models excel at structural constraints but miss deep correlations.

Relational synthetic data requires both sides to work together.

3. Our Use Case: A School Management Database

To illustrate the complexity of relational multi-table synthetic data generation, let’s look at a real example: a school management system. This system tracks students, teachers, courses, financial activity, academic performance, and various administrative operations.

Students
Student_id; enrollment_number; first_name last_name; date_of_birth; gender; address_line1; address_line2; city; state; postal_code; phone; email; guardian_name; guardian_contac;t enrollment_date; status ; homeroom_teacher_id

Teachers
teacher_id; employee_code; first_name; last_name; qualification; experience_years; department; phone; email; hire_date; status

Courses
course_id; course_code; name; description; credits; level; department; is_active; lead_teacher_id

Student Courses
student_course_id; student_id; course_id; academic_year; term; enrollment_status; grade_letter; grade_points

Marks
mark_id; student_course_id; assessed_by_teacher_id; assessment_type; max_score; score_obtained; weightage_percent; assessment_date

Fees
fee_id; student_id; fee_type; amount_due; amount_paid; due_date; payment_date; payment_mode; status; approved_by_teacher_id

Games Participation
games_participation_id; student_id; game_name; team_name; level; position_played; achievement; season_year; coach_teacher_id; manager_teacher_id

Canteen Transactions
transaction_id; student_id; transaction_date; item_description; quantity; amount; payment_method; authorized_by_teacher_id

Transport Assignments
transport_id; student_id; route_name; pickup_point; dropoff_point; vehicle_number; driver_name; driver_contact; valid_from; valid_to; monthly_fee; route_incharge_teacher_id

Course Instructors
course_instructor_id; course_id; teacher_id; academic_year; term; role

This schema alone shows why relational synthetic data is challenging:

Multiple one-to-many relationships
Several many-to-many join tables
Deep dependency chains (e.g., students → student_courses → marks)
Multiple foreign keys pointing to teachers
Highly diverse data types (UUIDs, dates, numbers, categorical values)
Behavioral/transactional tables (canteen, transport, fees)

This is the dataset we use to evaluate relational synthetic data techniques — and it clearly goes beyond what simple or single-table generative models can handle.

4. First Approach Tested: SDV Multitable – HMA Synthesizer

SDV offers HMASynthesizer for multi-table generation. HMA (Hierarchical Modeling Algorithm) is a statistical, non-GAN method.

Why we tested HMA

It supports relational structures
Ensures referential integrity automatically
Easy implementation for 1 parent → 1 child scenarios

Tables tested

students (parent)
fees (child, references student_id)

This is the minimum relational test case.

5. SDV HMA Experimental Output & Results

A. Synthetic table generation

HMA successfully produced synthetic versions of:

synthetic_students
synthetic_fees

B. Referential integrity check

HMA automatically enforces FK relationships.

Result: FK Violations: 0

This means:

Every row in fees.student_id correctly referenced an ID in students.student_id.
Structural integrity was preserved.

C. Cardinality Distribution Comparison

Real fees-per-student stats

textcount    991
mean     5.045
std      2.164
min      1
25%      3
50%      5
75%      6
max      15

Synthetic fees-per-student stats

textcount    988
mean     5.060
std      2.209
min      1
25%      4
50%      5
75%      7
max      11

Interpretation

The synthetic distribution closely matches the real one in mean and variance
Quartile shifts (3→4, 6→7) are mild
Maximum child count is reduced (15→11), a common issue in statistical models
Overall, the HMA output shows good statistical alignment for this simple scenario

6. Relational Score Using Non-GAN Approach (HMA)

Metric	Result
Foreign Key Integrity	100% (0 violations)
Cardinality Preservation	~94% similarity
Distribution Similarity (mean/std)	High match
Relational Realism Score	High for 1→N relations

From both FK checks + cardinality alignment: HMA successfully handled simple hierarchical relationships.

7. Limitations of HMA for Our Full Schema

While HMA worked for two tables, it fundamentally cannot scale to the full complexity of our school management database.

No Support for Many-to-Many Tables
HMA requires a tree-shaped relational structure. Tables like:
- student_courses (student_id, course_id)
- course_instructors (course_id, teacher_id)
  represent graphs, not trees. HMA cannot model a child table with two parents.
No Support for GAN Training
HMA is purely statistical. This means:
- No ability to learn high-dimensional correlations
- Poor performance on complex interactions
Cannot Learn Multi-Table Patterns
Relationships like:
- “Students with tougher courses tend to score lower marks”
- “Students in certain batches pay fees differently”
- “Teachers influencing student performance across multiple tables”
  cannot be learned by statistical hierarchies.
Fails on Synthetic Transactional or Behavioral Data
Tables like:
- canteen_transactions
- events_participation
- attendance_logs
  contain high-frequency behavioral data. These require GAN-based sequence modeling or temporal modeling, which HMA simply cannot handle.
Struggles With UUID-Based Identifiers
Our database uses UUIDs for:
- student_id
- fee_id
- course_id
  etc.
  UUIDs have extremely high cardinality, and statistical models cannot learn their structure. This results in:
- Reused IDs
- Incorrect string formats
- Potential FK mismatches
  We had to manually enforce regex-based UUID generation to fix this.
Cannot Handle Deep Graph-Shaped Schemas
Our dependency chains are not simple:
- students → student_courses → marks
- students → fees
- courses → course_instructors → teachers
- students → events_participation → event_details
  HMA cannot:
- Propagate relationships through multiple levels
- Learn cross-table correlations
- Handle graph-centric relational patterns
  It is limited to shallow, tree-like schemas only.
Cardinality Drift
HMA tends to:
- Under-estimate maximum values
- Smooth out spikes
- Lose long-tail behavior
  This leads to synthetic datasets that look “average” but lose realistic extremes.

8. Conclusion

Our initial experiments show:

What HMA can do well

Works for simple 1→N relationships
Perfect foreign key integrity
Good basic distribution alignment
Fast and simple to use

Where HMA fails

Many-to-many tables
Multi-parent relationships
Deep dependency structures
UUID-heavy schemas
High-dimensional correlations
Behavioral or transactional datasets
Any graph-shaped schema

Given all these limitations, HMA cannot be used for our full school-management database.

To generate realistic, structurally correct synthetic data for the entire relational system, a more advanced approach is required:

A multi-table GAN-based pipeline that models each table individually, conditions child tables on parent embeddings, and reconstructs relational integrity after generation.

This approach enables:

High realism
Support for many-to-many tables
Deep relational consistency
True cross-table correlation learning
Correct UUID formatting
Full graph-level reconstruction

This method is significantly more powerful than HMA and is suitable for real-world relational databases like ours.