Synthetic Data Tool Components¶
The Synthetic Data Tool package contains a number of field-level synthesizers, a model synthesizer, and a schema synthesizer.
base Module¶
Contains base class definitions for all synthesizers. This includes the field-level synthesizers, plus model-level and schema-level synthesizers.
A field-level synthesizer has two aspects.
An essential synth algorithms for various data types. These are generally randomly-generated sequences of values.
The behavior of the overall synth process. This falls into two variant strategies:
Independent. This means duplicates are possible, and the values are not referenced by other synthesizers.
Dependent. (“Pooled”) Duplicates will be prevented, and the pool of values can be referenced by a
synthdata.synths.SythesizerReferenceinstance.
In order to handle foreign key references, a synthdata.synths.SynthesizeReference
extracts values from a pooled synthesizer.
Important
This relies on Pydantic’s rich set of type annotations.
Behavior Strategy¶
The synthdata.base.Behavior class hierarchy defines a Strategy that
tailors the behavior of a synthdata.base.Synthesizer.
- class synthdata.base.Behavior(synth: Synthesizer)¶
General features of the alternative Strategty plug-ins for a
synthdata.base.Synthesizer.- abstract next() Any¶
Returns a next value for a Synthesizer.
- choice() Any¶
Returns a value chosen from a pool, only overridden by the Pooled subclass.
- class synthdata.base.Independent(synth: Synthesizer)¶
Defines a
synthdata.base.Synthesizerthat cannot be the target of aSynthesizerReference. Each value is generated randomly, with no assurance of uniqueness.This is – in effect – the default behavior of a
synthdata.base.Synthesizer. The methods are delegated back to thesynthdata.base.Synthesizerinstance.- __init__(synth: Synthesizer) None¶
Constructs a Behavior strategy bound to a specific
synthdata.base.Synthesizer.
- next() Any¶
Returns a value from
synthdata.base.Synthesizer.value_gen().
- class synthdata.base.Pooled(synth: Synthesizer)¶
Defines a
synthdata.base.Synthesizerthat created a pool of values. This can be the target of asynthdata.base.SynthesizerReferenceisntance. This requires an associatedsynthdata.base.SynthesizerModelto provide the target number of rows in the pool.Values in the pool will be unique.
- __init__(synth: Synthesizer) None¶
Constructs a Behavior strategy bound to a specific
synthdata.base.Synthesizer.
- fill()¶
Populate the
self.poolcollection of unique values. This usessynthdata.base.Synthesizer.value_gen()to build instances. This is used by the model’s _prepare() method.
- next() Any¶
In the event of using
next(synth_instance), thesynthdata.base.Synthesizer.__next__()picks a value at random from the pool using this.
- choice() Any¶
Returns a value chosen from a pool.
Synthesizer Class¶
This is the superclass for the various definitions in the synthdata.synths module.
- class synthdata.base.Synthesizer(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Abstract Base Class for all synthesisers.
- __init__(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>) None¶
Builds a Synthesizer that’s part of a
synthdata.base.ModelSynthesizer. This is associated with a specific field of theBaseModel.The
synthdata.base.Behavioris provided as a class name. The instance is created here after the field information is parsed by theprepare()method. This permits asynthdata.base.Pooledsynthesizer to immediately create the pool of unique values.
- next() Any¶
Get the next value from the
synthdata.base.BehaviorStrategy.
- choice() Any¶
Get an arbitrary value from a
synthdata.base.PooledStrategy.
- abstract initialize() None¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__().This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- abstract value_gen(sequence: int | None = None) Any¶
Low-level value synthesis.
- noise_gen(sequence: int | None = None) Any¶
Low-level noise synthesis. Pick one of the
noise_synthfunctions.
- get_meta(cls_: type[T], getter: Callable[[T], Any]) Any¶
Extract metadata from the Pydantic
FieldInfoinstance.
Todo
Reduce reliance on the Pydantic FieldInfo and annotation classes.
Model and Schema Synthesizers¶
The synthdata.base.BaseModelSynthesizer supports the pydantic BaseModel.
The synthdata.base.SchemaSynthesizer permits references among Synthesizers.
This permits FK references to PK pools.
- synthdata.base.synth_name_map() dict[str, type[Synthesizer]]¶
Finds all subclasses of
synthdata.base.Synthesizer.This is used once during model and schema creation to create a mapping from a name to a class.
Yields name and class from most specific to most general.
- class synthdata.base.DataIter(model: ModelSynthesizer, noise: float = 0.0)¶
Iterate through values of a
synthdata.base.ModelSynthesizerinstance. This will create dictionaries suitable for creating PydanticBaseModelinstances.A trash injector can replace good data with invalid values – bad numbers, bad strings, None, etc. The resulting object will not be a valid Pydantic
BaseModelinstance.Todo
Handle recursive structures here.
- __init__(model: ModelSynthesizer, noise: float = 0.0) None¶
Creates an iterator, bound to a
synthdata.base.ModelSynthesizerinstance.
- __next__() dict[str, Any]¶
Creates the next
dict[str, Any]object. Each field is created by the Sythesizer’s attached Behavior.
- class synthdata.base.ModelIter(model: ModelSynthesizer, noise: float = 0.0)¶
Iterate through values of a
synthdata.base.ModelSynthesizerinstance. This can only create the PydanticBaseModelinstances with valid data.- __init__(model: ModelSynthesizer, noise: float = 0.0) None¶
Creates an iterator, bound to a
synthdata.base.ModelSynthesizerinstance.
- class synthdata.base.ModelSynthesizer(cls_: type, rows: int | None = None)¶
Abstract Base Class for various model synthesizers.
- abstract __init__(cls_: type, rows: int | None = None) None¶
Initialize a model synthesizer.
- Parameters:
cls – a class definition with fields of some kind.
rows – the number of rows to generate when creating key pools.
noise – the probability of invalid data in the resulting object. Note that non-zero noise values may lead to Pydantic validation errors.
- class synthdata.base.BaseModelSynthesizer(cls_: type[BaseModel], rows: int | None = None)¶
Synthesizes instances of Pydantic
BaseModel.This expects a flat collection of atomic fields. In that way, it’s biased to work with SQL-oriented schema, which tend to be flat.
Todo
Handle recursive structures here.
- __init__(cls_: type[BaseModel], rows: int | None = None) None¶
Initializes a synthesizer for all fields of a given
BaseModel. The number of rows is only needed if there are anysynthdata.base.Pooledsynthesizers, generally because of SQL primary keys.Construction of the synthesizers is part of initialization. The
make_field_synth()method is exposed so subclasses can extend the processing.The general use case is the use the
fieldsattribute, which is a mapping of field names tosynthdata.base.Synthesizerinstances.Todo
Handle nested models.
- sql_rule(field: FieldInfo) tuple[type[Behavior] | None, dict[str, type[Synthesizer]] | None]¶
Rule 0 – PK’s are pooled, FK’s are references to PK pools. This also means
"sql": {'key': "foreign", ...will get a reference synthesizer.Todo
FK’s may have optionality rules.
The
SynthesizeReferenceinstance may need a subdomain distribution.Todo
What kind of error for invalid values?
For now, we simply create an Independent behavior. Perhaps a ValueError is better? Or a wanrning?
- explicit_rule(field: FieldInfo) tuple[type[Behavior] | None, dict[str, type[Synthesizer]] | None]¶
Rule 1 – json_schema_extra names a synthesizer class to use. Generally {“synthesizer”: “SomeSynth”}
- match_rule(field: FieldInfo) tuple[type[Behavior] | None, dict[str, type[Synthesizer] | None]]¶
Rule 2 – deduce synthesizer from annotation class and json_schema_extra. requires map keys names all subclasses first, and names all superclasses last.
Most cases there’s a single
{"type": type[Synthesizer]}.Optionality returns
{"type": type[Synthesizer], "NoneType": type[Synthesizer]}. Further details may be required to get the frequency of null values correct.- match field.annotation:
- case _UnionGenericAlias() | UnionType():
# If the union has two parts, one of which is None, then it’s simply optional. # Without further details, it’s not clear what fraction should be None. Assume 5%. # In general, a distribution table is required in json_schema_extra. #
{"subDomain": {"None": 1, "int": 99}}for simple optional. #{"subDomain": {"int": 3, "float": 1}}for multiple values.
- make_field_synth(field: FieldInfo) Synthesizer¶
Given a specific field, apply the matching rules to locate a synthesizer.
Rule 0. PK’s lead to a
synthdata.base.PooledStrategy. FK’s will be references to PK pools. Otherwise, synthesizers aresynthdata.base.Independent.Rule 1.
json_schema_extracan name a synthesizer to use. The key is"synthesizer". Example:{"synthesizer": "SomeSynth"}Rule 2. Deduce synthesizer from annotation class and
json_schema_extraproperties. This is done by evaluatingsynthdata.base.Synthesizer.match()for all subclasses ofsynthdata.base.Synthesizerfrom most specific subclass to most general superclass. First match is assigned.Todo
Confirm all alternatives defined in
synth_class_map.A None in
synth_class_mapmeans an unknown synth, possibly buried in a Union.
- class synthdata.base.SchemaSynthesizer¶
Synthesizes collections of Models.
A Schema has multiple models to handle SQL Foreign Key references to another model’s Primary Key.
The PK is a
synthdata.base.Synthesizerwithsynthdata.base.Pooledbehavior.The FK is a
synthdata.base.SynthesizeReferencethat extracts values from the PK’s pool.
First, use
add()to add aBaseModelto the schema.Then – after all models have been added – use
rows()to get rows for aBaseModel.The
noisevalue is the probability of noise – invalid values or None values if None is not permitted.Important
All models with PK’s must be defined before creating rows for any model with FK’s.
It’s best to define all model classes before trying to emit any data.
- __init__() None¶
- add(model_class: type[BaseModel], rows: int | None = None) None¶
Adds a
BaseModelto the working schema.- Raises:
KeyError – if the FK reference (
"Model.field") cannot be parsed.
- prepare() None¶
For Pooled synthesizers, this will populate the pools. Once. After that, it does nothing.
- reset() None¶
Reset the synthesizers to repopulate pools.
- rows(model_class: type[BaseModel]) Iterator[BaseModel]¶
Returns the iterator for the
synthdata.synth.SynthesizeModel. If all FK references are not resolved, will raise an exception.- Raises:
KeyError – if the FK reference (
"Model.field") cannot be found.ValueError – if the FK reference is not a Pooled synthesizer.
- data(model_class: type[BaseModel], noise: float = 0.0) Iterator[dict[str, Any]]¶
Returns an iterator for potential
synthdata.synth.SynthesizeModelinstances. Noise is injected and the values may not be valid.
synths Module¶
Synthesizer Definitions.
- class synthdata.synths.SynthesizeNone(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
SynthesizerReturns None. Handles
Optional[type]andtype | Noneannotated types.- initialize()¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__().This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- value_gen(sequence: int | None = None) Any¶
Low-level value synthesis.
- class synthdata.synths.SynthesizeString(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
SynthesizerSynthesizes printable strings with no whitespace or
\.Uses
Fieldattributesmin_length(default 1)max_length(default 32)
- initialize() None¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__().This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- value_gen(sequence: int | None = None) Any¶
Low-level value synthesis.
- classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool¶
Generic
Annotated[str, ...].
- class synthdata.synths.SynthesizeName(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
SynthesizeStringSynthesizer Name-like strings. This title-cases a random string.
Todo
Improve the name generator with better pattern (and anti-pattern).
Options
get first names from census data; get digraph frequency from last names.
Use NLTK digraph frequencies to generate plausible English-like works.
Default min_length is 3 to avoid 1-char names.
- value_gen(sequence: int | None = None) Any¶
Low-level value synthesis. Creates length and a string of the desired length in Title Case.
- classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool¶
Requires
Annotated[str, ...]andjson_schema_extrawith{"domain": "name"}
- class synthdata.synths.SynthesizeNumber(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
SynthesizerAbstract number synthesizer.
Uses
Fieldattributesge(default 0)le(default 2**32 - 1, $2^{32} - 1$.
Uses
json_schema_extravalues"distribution"– can be"normal"or"uniform".
- initialize() None¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__().This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- class synthdata.synths.SynthesizeInteger(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
SynthesizeNumberExtends
synthdata.synths.SynthesizeNumberto offer only integer values.- initialize() None¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__().This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- value_gen(sequence: int | None = None) Any¶
Creates integer in range with given distribution.
- classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool¶
Requires
Annotated[int, ...].
- class synthdata.synths.SynthesizeFloat(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
SynthesizeNumberExtends
synthdata.synths.SynthesizeNumberto offer only float values.- initialize() None¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__().This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- value_gen(sequence: int | None = None) Any¶
Creates float in range with given distribution.
- classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool¶
Requires
Annotated[float, ...].
- class synthdata.synths.SynthesizeDate(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
SynthesizeNumberExtends
synthdata.synths.SynthesizeNumberto offer only datetime.datetime values.Default range is 1970-Jan-1 to 2099-Dec-31.
- initialize() None¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__().This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- value_gen(sequence: int | None = None) Any¶
Creates date in range with given distribution.
- classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool¶
Requires
Annotated[datetime.datetime, ...]