Code¶
In this section we’ll look at the sample application. This comes in two parts: the schema and the main application to emit rows.
Schema¶
We’ll look at the sample_schema.py
module.
First, the imports.
import datetime
from typing import Annotated
from pydantic import BaseModel, Field
Next the Employee
model.
class Employee(BaseModel):
id: Annotated[
int,
Field(json_schema_extra={"sql": {"key": "primary"}}),
]
name: Annotated[
str,
Field(
max_length=40, json_schema_extra={"domain": "name"}
),
]
hire_date: Annotated[
datetime.datetime,
Field(ge=datetime.datetime(2021, 1, 18)),
]
velocity: Annotated[
float,
Field(
ge=2,
le=21,
json_schema_extra={"distribution": "normal"},
),
]
manager: Annotated[
int,
Field(
json_schema_extra={
"sql": {
"key": "foreign",
"reference": "Manager.id",
}
}
),
]
Note that each field has annotations to define the desired synthetic data.
Finally, the Manager
model.
class Manager(BaseModel):
id: Annotated[
int,
Field(json_schema_extra={"sql": {"key": "primary"}}),
]
employee_id: Annotated[
int,
Field(
json_schema_extra={
"sql": {
"key": "foreign",
"reference": "Employee.id",
}
}
),
]
department_id: Annotated[str, Field(max_length=8)]
Note that each field has annotations to define the desired synthetic data.
Yes, these are what are often called “anemic” models. They lack any methods or processing related to the relationships between the two items.
Data Generator App¶
Here’s the sample_app.py
module.
Starting with the imports
import csv
from pathlib import Path
from sample_schema import *
from synthdata import SchemaSynthesizer, synth_class_iter
The main function does four separate things.
Dump the available synthesizers.
Build a schema.
Export
Employee
data.Export
Manager
data.
def main():
print("Available synth rules:")
for n, v in synth_class_iter():
if v.match.__doc__:
print(f" {n:24} {v.match.__doc__}")
The synthdata.base.synth_class_iter()
emits a sequence of (name, class) pairs.
Generally, if the class lacks a docstring on the synthdata.base.Synthesizer.match`()
method, it means the class is abstract.
s = SchemaSynthesizer()
s.add(Employee, 100)
s.add(Manager, 10)
The schema, s
, is populated with two classes.
This will build synthdata.base.ModelSynthesizer()
instances for each class.
Any pooled synthesizers will be used to fill the needed PK pools.
with open(
Path("data/employee.csv"), "w", newline=""
) as output:
writer = csv.DictWriter(
output,
fieldnames=list(Employee.model_fields.keys()),
)
for row in s.rows(Employee):
This writes Employee
instances to both the console and a file.
The csv
module’s DictWriter
class is initialized with the field names from the Employee
class.
Then each object’s dict()
result is used to write a row to the file.
Note that only key values are pooled. The Employee
instances are built as needed from the key pool.
writer.writerow(row.dict())
with open(
Path("data/manager.csv"), "w", newline=""
) as output:
This writes Manager
instances to a file.
The csv
module’s DictWriter
class is initialized with the field names from the Manager
class.
This uses a streamlined approach to applying the dict()
method to each object and writing all of the resulting rows.
Since the schema specifies there are 100 employees and 10 managers, the average cardinality of the manager to employee relationship is going to be \(\tfrac{1}{10}\). This distribtion tends to be relatively flat in the current implementation. For some more nuanced database query design issues, a more complicated weighted pool is required to create the needed bias in relationships.