Challenges in Relational Multi-Table Synthetic Data Generation

1. Introduction

Synthetic data generation is increasingly important when working with sensitive or regulated datasets. While generating synthetic data for single tables is straightforward using GANs or statistical models, generating relational multi-table synthetic data is significantly more complex.

Relational databases do not exist in isolation. They contain relationships that define how information flows across the system:

  • Foreign keys (parent → child)
  • Many-to-one (Student → Fees)
  • One-to-many (Teacher → Courses)
  • Many-to-many join tables (Student ↔ Courses)
  • Deep multi-level dependencies (Student → StudentCourses → Marks → Rank)

Every table influences others, and these dependencies must be preserved in the synthetic version. This combination of relational structure + statistical realism makes multi-table synthetic generation one of the toughest challenges in modern data science.

2. Problems With Relational Multi-Table Synthetic Data

Relational synthetic data must simultaneously maintain two distinct but equally critical properties.

A. Structural Correctness

Every foreign key must point to a valid parent record:

  • No orphan rows
  • No mismatched or invalid IDs
  • Correct table row counts and link consistency

B. Statistical Realism

Beyond structural correctness, the synthetic dataset must behave like the real dataset:

  • Distributions of numeric values (mean, variance, skewness)
  • Categorical patterns and frequencies
  • Joint relationships between columns
  • Cross-table correlations
  • Behavioral patterns (e.g., students with more courses tend to pay more fees)
  • Preserving cardinality patterns (fees per student, courses per teacher)

Most simple generators fail because they satisfy either structure or realism, but not both. GANs excel at statistical realism but know nothing about foreign key rules. Statistical models excel at structural constraints but miss deep correlations.

Relational synthetic data requires both sides to work together.

3. Our Use Case: A School Management Database

To illustrate the complexity of relational multi-table synthetic data generation, let’s look at a real example: a school management system. This system tracks students, teachers, courses, financial activity, academic performance, and various administrative operations.

Students
Student_id; enrollment_number; first_name last_name; date_of_birth; gender; address_line1; address_line2; city; state; postal_code; phone; email; guardian_name; guardian_contac;t enrollment_date; status ; homeroom_teacher_id

Teachers
teacher_id; employee_code; first_name; last_name; qualification; experience_years; department; phone; email; hire_date; status

Courses
course_id; course_code; name; description; credits; level; department; is_active; lead_teacher_id

Student Courses
student_course_id; student_id; course_id; academic_year; term; enrollment_status; grade_letter; grade_points

Marks
mark_id; student_course_id; assessed_by_teacher_id; assessment_type; max_score; score_obtained; weightage_percent; assessment_date

Fees
fee_id; student_id; fee_type; amount_due; amount_paid; due_date; payment_date; payment_mode; status; approved_by_teacher_id

Games Participation
games_participation_id; student_id; game_name; team_name; level; position_played; achievement; season_year; coach_teacher_id; manager_teacher_id

Canteen Transactions
transaction_id; student_id; transaction_date; item_description; quantity; amount; payment_method; authorized_by_teacher_id

Transport Assignments
transport_id; student_id; route_name; pickup_point; dropoff_point; vehicle_number; driver_name; driver_contact; valid_from; valid_to; monthly_fee; route_incharge_teacher_id

Course Instructors
course_instructor_id; course_id; teacher_id; academic_year; term; role

This schema alone shows why relational synthetic data is challenging:

  • Multiple one-to-many relationships
  • Several many-to-many join tables
  • Deep dependency chains (e.g., students → student_courses → marks)
  • Multiple foreign keys pointing to teachers
  • Highly diverse data types (UUIDs, dates, numbers, categorical values)
  • Behavioral/transactional tables (canteen, transport, fees)

This is the dataset we use to evaluate relational synthetic data techniques — and it clearly goes beyond what simple or single-table generative models can handle.

4. First Approach Tested: SDV Multitable – HMA Synthesizer

SDV offers HMASynthesizer for multi-table generation. HMA (Hierarchical Modeling Algorithm) is a statistical, non-GAN method.

Why we tested HMA

  • It supports relational structures
  • Ensures referential integrity automatically
  • Easy implementation for 1 parent → 1 child scenarios

Tables tested

  • students (parent)
  • fees (child, references student_id)

This is the minimum relational test case.

5. SDV HMA Experimental Output & Results

A. Synthetic table generation

HMA successfully produced synthetic versions of:

  • synthetic_students
  • synthetic_fees

B. Referential integrity check

HMA automatically enforces FK relationships.

Result: FK Violations: 0

This means:

  • Every row in fees.student_id correctly referenced an ID in students.student_id.
  • Structural integrity was preserved.

C. Cardinality Distribution Comparison

Real fees-per-student stats

textcount    991
mean     5.045
std      2.164
min      1
25%      3
50%      5
75%      6
max      15

Synthetic fees-per-student stats

textcount    988
mean     5.060
std      2.209
min      1
25%      4
50%      5
75%      7
max      11

Interpretation

  • The synthetic distribution closely matches the real one in mean and variance
  • Quartile shifts (3→4, 6→7) are mild
  • Maximum child count is reduced (15→11), a common issue in statistical models
  • Overall, the HMA output shows good statistical alignment for this simple scenario

6. Relational Score Using Non-GAN Approach (HMA)

MetricResult
Foreign Key Integrity100% (0 violations)
Cardinality Preservation~94% similarity
Distribution Similarity (mean/std)High match
Relational Realism ScoreHigh for 1→N relations

From both FK checks + cardinality alignment: HMA successfully handled simple hierarchical relationships.

7. Limitations of HMA for Our Full Schema

While HMA worked for two tables, it fundamentally cannot scale to the full complexity of our school management database.

  1. No Support for Many-to-Many Tables
    HMA requires a tree-shaped relational structure. Tables like:
    • student_courses (student_id, course_id)
    • course_instructors (course_id, teacher_id)
      represent graphs, not trees. HMA cannot model a child table with two parents.
  2. No Support for GAN Training
    HMA is purely statistical. This means:
    • No ability to learn high-dimensional correlations
    • Poor performance on complex interactions
  3. Cannot Learn Multi-Table Patterns
    Relationships like:
    • “Students with tougher courses tend to score lower marks”
    • “Students in certain batches pay fees differently”
    • “Teachers influencing student performance across multiple tables”
      cannot be learned by statistical hierarchies.
  4. Fails on Synthetic Transactional or Behavioral Data
    Tables like:
    • canteen_transactions
    • events_participation
    • attendance_logs
      contain high-frequency behavioral data. These require GAN-based sequence modeling or temporal modeling, which HMA simply cannot handle.
  5. Struggles With UUID-Based Identifiers
    Our database uses UUIDs for:
    • student_id
    • fee_id
    • course_id
      etc.
      UUIDs have extremely high cardinality, and statistical models cannot learn their structure. This results in:
    • Reused IDs
    • Incorrect string formats
    • Potential FK mismatches
      We had to manually enforce regex-based UUID generation to fix this.
  6. Cannot Handle Deep Graph-Shaped Schemas
    Our dependency chains are not simple:
    • students → student_courses → marks
    • students → fees
    • courses → course_instructors → teachers
    • students → events_participation → event_details
      HMA cannot:
    • Propagate relationships through multiple levels
    • Learn cross-table correlations
    • Handle graph-centric relational patterns
      It is limited to shallow, tree-like schemas only.
  7. Cardinality Drift
    HMA tends to:
    • Under-estimate maximum values
    • Smooth out spikes
    • Lose long-tail behavior
      This leads to synthetic datasets that look “average” but lose realistic extremes.

8. Conclusion

Our initial experiments show:

What HMA can do well

  • Works for simple 1→N relationships
  • Perfect foreign key integrity
  • Good basic distribution alignment
  • Fast and simple to use

Where HMA fails

  • Many-to-many tables
  • Multi-parent relationships
  • Deep dependency structures
  • UUID-heavy schemas
  • High-dimensional correlations
  • Behavioral or transactional datasets
  • Any graph-shaped schema

Given all these limitations, HMA cannot be used for our full school-management database.

To generate realistic, structurally correct synthetic data for the entire relational system, a more advanced approach is required:

A multi-table GAN-based pipeline that models each table individually, conditions child tables on parent embeddings, and reconstructs relational integrity after generation.

This approach enables:

  • High realism
  • Support for many-to-many tables
  • Deep relational consistency
  • True cross-table correlation learning
  • Correct UUID formatting
  • Full graph-level reconstruction

This method is significantly more powerful than HMA and is suitable for real-world relational databases like ours.

Leave a Comment

Your email address will not be published. Required fields are marked *