Synthetic Data¶
Not focused on Generative Adversarial Networks (although that's possible)
Focused on Database, ETL, and data analytics
Steven Lott
About Me¶
I hold a patent in some synthetic data processing.
Not for generative machine learning models. (I need to say that early and often.)
Agenda¶
What do you mean "Synthetic Data"?
Isn't "synthetic" a silly distinction?
Why bother?
Some tooling for creating useful synthetic data
The Cool Stuff
Part 1. What do you mean Synthetic?¶
Created.
Not collected.
Free from difficult-to-explain features.
Not necessarily useful for Generative AI Models
Test Cases¶
Test cases use made-up (synthetic) data.
from string import punctuation, whitespace
def some_func(x):
"""
>>> some_func("hello world") = {'d', 'e', 'h', 'l', 'o', 'r', 'w'}
"""
return set(x) - (set(punctuation) | set(whitespace))
We can test by monkeypatching the set()
, replacing it with a MagicMock
.
And assure that some sentinel
objects were operated on.
Check the call history to look for __sub__()
and __or__()
operations.
But...¶
Using Sentinel
and Mock
isn't very compelling.
It tells us the implementation does some specific steps.
But does it tell us the implementation will get the right answer?
(Hint: Not really.)
Question of Scale¶
A single example is "artisinal data"
A bunch of examples is "data"
How big a "bunch" do we need?
Boundary Value Analysis¶
6 common cases:
Min - 1, Min, Min + 1, "Nominal Value", Max - 1, Max, Max + 1
Is that enough? One value in range?
factorial(5)
120.0
factorial(4.5)
52.34277778455352
factorial(-0.5)**2
3.1415926535897927
Unfair!¶
Boundary analysis was clear: -0.5 was out-of-bounds.
Testing float values for what appears to be an int function isn't right.
Like lifting your lawn mower to trim your hedge.
Not the point of synthetic data.
Better Examples¶
Database design.
Does the SQL perform?
Or is a different non-relational model better?
Algorithm design.
Network design.
Does it scale?
Consider¶
What locations does a department use?
When there are 40,000 employees? 5,000 managers?
Common Approach¶
Mock up a few Departments with a few Locations.
Use Boundary Value Analysis to handle optional, missing, and special-case relationsips.
One -- maybe two -- of each.
Lovingly hand-craft the
INSERT
statements.
Better Approach¶
Define the schema in considerable, realistic detail.
Build representative data in representative volumes.
Load a database. Populate optimizer statistics.
Run the queries.
Refine the schema when it performs poorly.
Define transactions and updates to examine statistics processing.
Impediments¶
SQL Schema is a flaming dumpster fire.
Even JSON schema lacks details.
Distributions and Histograms are required.
- For new development, guess wisely.
- You'll guess wrong: plan for change.
Part 3. Some tooling for creating useful synthetic data¶
Schema Definition
Data Generation
Schema Definition¶
Define schema with JSONSchema + Extensions.
Use JSONSchema to synthesize data.
If you're using a SQL database, map JSONSchema to SQL.
It's fun to use Pydantic to generate JSONSchema.
Starting Point -- Design¶
Use PlantUML to sketch the data model.
It draws diagrams.
plantuml
@startuml
class Employee {
id: int
{field} name: string(40)
hire_date: timestamp
velocity: float
manager_id: int
}
@enduml
Refine to Pydantic Model¶
import datetime
from typing import Annotated
from pydantic import BaseModel, Field
class Employee(BaseModel):
id: int
name: Annotated[
str, Field(max_length=40)]
hire_date: Annotated[
datetime.datetime, Field(ge=datetime.datetime(2021, 1, 18))]
velocity: Annotated[
float, Field(ge=2, le=21,
json_schema_extra={"distribution": "Normal"})]
manager: Annotated[
int, Field(json_schema_extra={"sql": {'foreignKey': 'Manager.id'}})]
Create JSONSchema from class definition¶
from pprint import pprint
pprint(Employee.model_json_schema())
{'properties': {'hire_date': {'format': 'date-time', 'title': 'Hire Date', 'type': 'string'}, 'id': {'title': 'Id', 'type': 'integer'}, 'manager': {'sql': {'foreignKey': 'Manager.id'}, 'title': 'Manager', 'type': 'integer'}, 'name': {'maxLength': 40, 'title': 'Name', 'type': 'string'}, 'velocity': {'distribution': 'Normal', 'maximum': 21.0, 'minimum': 2.0, 'title': 'Velocity', 'type': 'number'}}, 'required': ['id', 'name', 'hire_date', 'velocity', 'manager'], 'title': 'Employee', 'type': 'object'}
Build Field Values¶
from typing import Any
from pydantic.fields import FieldInfo
from annotated_types import MaxLen, Ge, Le
from operator import attrgetter
def field_generate(field: FieldInfo) -> Any:
if issubclass(field.annotation, str):
size = get_meta(MaxLen, attrgetter("max_length"), field)
print("String", size, field.metadata, field.json_schema_extra)
return "String"
elif issubclass(field.annotation, int):
min_value = get_meta(Ge, attrgetter("ge"), field)
max_value = get_meta(Le, attrgetter("le"), field)
print("Integer", min_value, max_value, field.metadata, field.json_schema_extra)
return min_value if min_value is not None else 0
elif issubclass(field.annotation, datetime.datetime):
default = datetime.datetime(1970, 1, 1)
min_value = get_meta(Ge, attrgetter("ge"), field)
print("Date", min_value, field.metadata, field.json_schema_extra)
return min_value or default
elif issubclass(field.annotation, float):
min_value = get_meta(Ge, attrgetter("ge"), field)
max_value = get_meta(Le, attrgetter("le"), field)
print("Float", min_value, max_value, field.metadata, field.json_schema_extra)
return min_value if min_value is not None else 0.0
else:
raise ValueError(f"unsupported {field.annotation=} {field.metadata=} {field.json_schema_extra=}")
This design isn't particularly good.
But.
It gets you started immediately.
Build Objects¶
def row_generate(cls_: type[BaseModel]) -> BaseModel:
data = {
name: field_generate(field)
for name, field in cls_.model_fields.items()
}
# Derived values (2NF and 3NF violations.)
return cls_(**data)
row_generate(Employee)
Integer None None [] None String 40 [MaxLen(max_length=40)] None Date 2021-01-18 00:00:00 [Ge(ge=datetime.datetime(2021, 1, 18, 0, 0))] None Float 2 21 [Ge(ge=2), Le(le=21)] {'distribution': 'Normal'} Integer None None [] {'sql': {'foreignKey': 'Manager.id'}}
Employee(id=0, name='String', hire_date=datetime.datetime(2021, 1, 18, 0, 0), velocity=2.0, manager=0)
Serialize as needed¶
While SQL-like databases are common...
No-SQL databases and non-database processing is more common.
In some cases, you need to make persistent data.
- CSV
- Newline Delimited JSON
- SQL INSERT script
Make SQL Table Definitions (if you must)¶
def SQLSchema(some_class):
js = some_class.model_json_schema()
print(f"CREATE IF NOT EXISTS TABLE {js['title']} (")
for name, p in js['properties'].items():
print(f" {name} {p['type'].upper()},")
print(")")
SQLSchema(Employee)
CREATE IF NOT EXISTS TABLE Employee ( id INTEGER, name STRING, hire_date STRING, velocity NUMBER, manager INTEGER, )
Part 4. The Cool Stuff¶
Fine-Tuning Data Domains
- String sizes and patterns
- Numbers with distributions
- Dates and DateTimes
- Enumerated values with histograms
- Benford's Law
- Foreign Keys
- Optional Values
Plan on revising and extending these constantly.
Strings¶
minLegth
andmaxLength
pattern
a regular expressionformat
a set of data domains like "date", "time", "date-time", "duration", "email", etc.
Outside the pattern and format, supplemental details are needed
- length distribution
- "subdomain" of string: text, name, address, part number, zip code, etc.
You'll often have to invent -- and re-invent -- these
Numbers (int, float, currency, durations)¶
- Needs min, max, distribution.
Some distributions directly supported by random
: uniform, triangular, exponential, normal, etc.
Benford is not directly available.
Dates¶
Needs min, max, distribution.
Convert min and max to ordinal dates or timestamps.
Use ordinary numeric generator functions you already wrote.
Use fromordinal()
or fromtimestamp()
to create a valid datetime object.
Enumerated Values¶
- Codes and status values and what-not
- Very narrow domains
- Very lumpy distributions
Use random.choices()
with relative weights.
Benford's Law¶
Leading digit of measured values are not always uniformly distributed.
Bank balance amounts, for example.
Tricky
Use ordinary numeric generator functions you already wrote.
Replace the 1st digit following Benford's Law via random.choices()
.
Iterate if it's no longer in the valid range.
Foreign Keys¶
Needs "cardinality" factor.
- Define table row counts: Employee has 10,000, Managers has 2,000
- Gives 1:m where mean(m) = 5.
Generate primary key domains for all tables in the schema.
Shuffle domain into a random order to build PK's.
Selecting from domain for FK's.
Optional Values¶
Needs "optionality" factor.
This is probability of None.
Sometimes too simplistic: consider "Domain-specific nulls"
A special value that is semantically a null,
not literally None or NULL or whatever.
Social Security Number 999-99-9999.
There's Still More¶
Code often imposes Yet More Constraints.
Unions (also known as Subentities)
- A discriminator
- Optional fields depending on discriminator value
Relationships among fields (A = B + C)
2NF optimizations
- One field has a Foreign Key
- Another field non-key values from the foreign table
Conclusion¶
One size of data synth will not solve very many problems.
Each design problem will add yet more interesting complications and relationships.
Plan to validate synth data against actual data.
- And revise designs as reality intevenes.
- And revise again with each lesson learned.
Plan for ongoing synth to handle usage changes and new features.