Code¶

In this section we’ll look at the sample application. This comes in two parts: the schema and the main application to emit rows.

Schema¶

We’ll look at the sample_schema.py module.

First, the imports.

import datetime
from typing import Annotated
from pydantic import BaseModel, Field

Next the Employee model.

class Employee(BaseModel):
    id: Annotated[
        int,
        Field(json_schema_extra={"sql": {"key": "primary"}}),
    ]
    name: Annotated[
        str,
        Field(
            max_length=40, json_schema_extra={"domain": "name"}
        ),
    ]
    hire_date: Annotated[
        datetime.datetime,
        Field(ge=datetime.datetime(2021, 1, 18)),
    ]
    velocity: Annotated[
        float,
        Field(
            ge=2,
            le=21,
            json_schema_extra={"distribution": "normal"},
        ),
    ]
    manager: Annotated[
        int,
        Field(
            json_schema_extra={
                "sql": {
                    "key": "foreign",
                    "reference": "Manager.id",
                }
            }
        ),
    ]

Note that each field has annotations to define the desired synthetic data.

Finally, the Manager model.

class Manager(BaseModel):
    id: Annotated[
        int,
        Field(json_schema_extra={"sql": {"key": "primary"}}),
    ]
    employee_id: Annotated[
        int,
        Field(
            json_schema_extra={
                "sql": {
                    "key": "foreign",
                    "reference": "Employee.id",
                }
            }
        ),
    ]
    department_id: Annotated[str, Field(max_length=8)]

Note that each field has annotations to define the desired synthetic data.

Yes, these are what are often called “anemic” models. They lack any methods or processing related to the relationships between the two items.

Data Generator App¶

Here’s the sample_app.py module. Starting with the imports

import csv
from pathlib import Path

from sample_schema import *
from synthdata import SchemaSynthesizer, synth_class_iter

The main function does four separate things.

Dump the available synthesizers.
Build a schema.
Export Employee data.
Export Manager data.

def main():
    print("Available synth rules:")
    for n, v in synth_class_iter():
        if v.match.__doc__:
            print(f"  {n:24} {v.match.__doc__}")

The synthdata.base.synth_class_iter() emits a sequence of (name, class) pairs. Generally, if the class lacks a docstring on the synthdata.base.Synthesizer.match`() method, it means the class is abstract.

    s = SchemaSynthesizer()
    s.add(Employee, 100)
    s.add(Manager, 10)

The schema, s, is populated with two classes. This will build synthdata.base.ModelSynthesizer() instances for each class. Any pooled synthesizers will be used to fill the needed PK pools.

    with open(
        Path("data/employee.csv"), "w", newline=""
    ) as output:
        writer = csv.DictWriter(
            output,
            fieldnames=list(Employee.model_fields.keys()),
        )
        for row in s.rows(Employee):

This writes Employee instances to both the console and a file. The csv module’s DictWriter class is initialized with the field names from the Employee class. Then each object’s dict() result is used to write a row to the file.

Note that only key values are pooled. The Employee instances are built as needed from the key pool.

            writer.writerow(row.dict())

    with open(
        Path("data/manager.csv"), "w", newline=""
    ) as output:

This writes Manager instances to a file. The csv module’s DictWriter class is initialized with the field names from the Manager class. This uses a streamlined approach to applying the dict() method to each object and writing all of the resulting rows.

Since the schema specifies there are 100 employees and 10 managers, the average cardinality of the manager to employee relationship is going to be \(\tfrac{1}{10}\). This distribtion tends to be relatively flat in the current implementation. For some more nuanced database query design issues, a more complicated weighted pool is required to create the needed bias in relationships.

Code¶

Schema¶

Data Generator App¶

Synthetic Data Tools

Navigation

Related Topics