Synthetic Data Tool Components¶
The Synthetic Data Tool package contains a number of field-level synthesizers, a model synthesizer, and a schema synthesizer.
base
Module¶
Contains base class definitions for all synthesizers. This includes the field-level synthesizers, plus model-level and schema-level synthesizers.
A field-level synthesizer has two aspects.
An essential synth algorithms for various data types. These are generally randomly-generated sequences of values.
The behavior of the overall synth process. This falls into two variant strategies:
Independent. This means duplicates are possible, and the values are not referenced by other synthesizers.
Dependent. (“Pooled”) Duplicates will be prevented, and the pool of values can be referenced by a
synthdata.synths.SythesizerReference
instance.
In order to handle foreign key references, a synthdata.synths.SynthesizeReference
extracts values from a pooled synthesizer.
Important
This relies on Pydantic’s rich set of type annotations.
Behavior Strategy¶
The synthdata.base.Behavior
class hierarchy defines a Strategy that
tailors the behavior of a synthdata.base.Synthesizer
.
- class synthdata.base.Behavior(synth: Synthesizer)¶
General features of the alternative Strategty plug-ins for a
synthdata.base.Synthesizer
.- abstract next() Any ¶
Returns a next value for a Synthesizer.
- choice() Any ¶
Returns a value chosen from a pool, only overridden by the Pooled subclass.
- class synthdata.base.Independent(synth: Synthesizer)¶
Defines a
synthdata.base.Synthesizer
that cannot be the target of aSynthesizerReference
. Each value is generated randomly, with no assurance of uniqueness.This is – in effect – the default behavior of a
synthdata.base.Synthesizer
. The methods are delegated back to thesynthdata.base.Synthesizer
instance.- __init__(synth: Synthesizer) None ¶
Constructs a Behavior strategy bound to a specific
synthdata.base.Synthesizer
.
- next() Any ¶
Returns a value from
synthdata.base.Synthesizer.value_gen()
.
- class synthdata.base.Pooled(synth: Synthesizer)¶
Defines a
synthdata.base.Synthesizer
that created a pool of values. This can be the target of asynthdata.base.SynthesizerReference
isntance. This requires an associatedsynthdata.base.SynthesizerModel
to provide the target number of rows in the pool.Values in the pool will be unique.
- __init__(synth: Synthesizer) None ¶
Constructs a Behavior strategy bound to a specific
synthdata.base.Synthesizer
.
- fill()¶
Populate the
self.pool
collection of unique values. This usessynthdata.base.Synthesizer.value_gen()
to build instances. This is used by the model’s _prepare() method.
- next() Any ¶
In the event of using
next(synth_instance)
, thesynthdata.base.Synthesizer.__next__()
picks a value at random from the pool using this.
- choice() Any ¶
Returns a value chosen from a pool.
Synthesizer Class¶
This is the superclass for the various definitions in the synthdata.synths
module.
- class synthdata.base.Synthesizer(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Abstract Base Class for all synthesisers.
- __init__(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>) None ¶
Builds a Synthesizer that’s part of a
synthdata.base.ModelSynthesizer
. This is associated with a specific field of theBaseModel
.The
synthdata.base.Behavior
is provided as a class name. The instance is created here after the field information is parsed by theprepare()
method. This permits asynthdata.base.Pooled
synthesizer to immediately create the pool of unique values.
- next() Any ¶
Get the next value from the
synthdata.base.Behavior
Strategy.
- choice() Any ¶
Get an arbitrary value from a
synthdata.base.Pooled
Strategy.
- abstract initialize() None ¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__()
.This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- abstract value_gen(sequence: int | None = None) Any ¶
Low-level value synthesis.
- noise_gen(sequence: int | None = None) Any ¶
Low-level noise synthesis. Pick one of the
noise_synth
functions.
- get_meta(cls_: type[T], getter: Callable[[T], Any]) Any ¶
Extract metadata from the Pydantic
FieldInfo
instance.
Todo
Reduce reliance on the Pydantic FieldInfo
and annotation classes.
Model and Schema Synthesizers¶
The synthdata.base.BaseModelSynthesizer
supports the pydantic BaseModel
.
The synthdata.base.SchemaSynthesizer
permits references among Synthesizers.
This permits FK references to PK pools.
- synthdata.base.synth_name_map() dict[str, type[Synthesizer]] ¶
Finds all subclasses of
synthdata.base.Synthesizer
.This is used once during model and schema creation to create a mapping from a name to a class.
Yields name and class from most specific to most general.
- class synthdata.base.DataIter(model: ModelSynthesizer, noise: float = 0.0)¶
Iterate through values of a
synthdata.base.ModelSynthesizer
instance. This will create dictionaries suitable for creating PydanticBaseModel
instances.A trash injector can replace good data with invalid values – bad numbers, bad strings, None, etc. The resulting object will not be a valid Pydantic
BaseModel
instance.Todo
Handle recursive structures here.
- __init__(model: ModelSynthesizer, noise: float = 0.0) None ¶
Creates an iterator, bound to a
synthdata.base.ModelSynthesizer
instance.
- __next__() dict[str, Any] ¶
Creates the next
dict[str, Any]
object. Each field is created by the Sythesizer’s attached Behavior.
- class synthdata.base.ModelIter(model: ModelSynthesizer, noise: float = 0.0)¶
Iterate through values of a
synthdata.base.ModelSynthesizer
instance. This can only create the PydanticBaseModel
instances with valid data.- __init__(model: ModelSynthesizer, noise: float = 0.0) None ¶
Creates an iterator, bound to a
synthdata.base.ModelSynthesizer
instance.
- class synthdata.base.ModelSynthesizer(cls_: type, rows: int | None = None)¶
Abstract Base Class for various model synthesizers.
- abstract __init__(cls_: type, rows: int | None = None) None ¶
Initialize a model synthesizer.
- Parameters:
cls – a class definition with fields of some kind.
rows – the number of rows to generate when creating key pools.
noise – the probability of invalid data in the resulting object. Note that non-zero noise values may lead to Pydantic validation errors.
- class synthdata.base.BaseModelSynthesizer(cls_: type[BaseModel], rows: int | None = None)¶
Synthesizes instances of Pydantic
BaseModel
.This expects a flat collection of atomic fields. In that way, it’s biased to work with SQL-oriented schema, which tend to be flat.
Todo
Handle recursive structures here.
- __init__(cls_: type[BaseModel], rows: int | None = None) None ¶
Initializes a synthesizer for all fields of a given
BaseModel
. The number of rows is only needed if there are anysynthdata.base.Pooled
synthesizers, generally because of SQL primary keys.Construction of the synthesizers is part of initialization. The
make_field_synth()
method is exposed so subclasses can extend the processing.The general use case is the use the
fields
attribute, which is a mapping of field names tosynthdata.base.Synthesizer
instances.Todo
Handle nested models.
- sql_rule(field: FieldInfo) tuple[type[Behavior] | None, dict[str, type[Synthesizer]] | None] ¶
Rule 0 – PK’s are pooled, FK’s are references to PK pools. This also means
"sql": {'key': "foreign", ...
will get a reference synthesizer.Todo
FK’s may have optionality rules.
The
SynthesizeReference
instance may need a subdomain distribution.Todo
What kind of error for invalid values?
For now, we simply create an Independent behavior. Perhaps a ValueError is better? Or a wanrning?
- explicit_rule(field: FieldInfo) tuple[type[Behavior] | None, dict[str, type[Synthesizer]] | None] ¶
Rule 1 – json_schema_extra names a synthesizer class to use. Generally {“synthesizer”: “SomeSynth”}
- match_rule(field: FieldInfo) tuple[type[Behavior] | None, dict[str, type[Synthesizer] | None]] ¶
Rule 2 – deduce synthesizer from annotation class and json_schema_extra. requires map keys names all subclasses first, and names all superclasses last.
Most cases there’s a single
{"type": type[Synthesizer]}
.Optionality returns
{"type": type[Synthesizer], "NoneType": type[Synthesizer]}
. Further details may be required to get the frequency of null values correct.- match field.annotation:
- case _UnionGenericAlias() | UnionType():
# If the union has two parts, one of which is None, then it’s simply optional. # Without further details, it’s not clear what fraction should be None. Assume 5%. # In general, a distribution table is required in json_schema_extra. #
{"subDomain": {"None": 1, "int": 99}}
for simple optional. #{"subDomain": {"int": 3, "float": 1}}
for multiple values.
- make_field_synth(field: FieldInfo) Synthesizer ¶
Given a specific field, apply the matching rules to locate a synthesizer.
Rule 0. PK’s lead to a
synthdata.base.Pooled
Strategy. FK’s will be references to PK pools. Otherwise, synthesizers aresynthdata.base.Independent
.Rule 1.
json_schema_extra
can name a synthesizer to use. The key is"synthesizer"
. Example:{"synthesizer": "SomeSynth"}
Rule 2. Deduce synthesizer from annotation class and
json_schema_extra
properties. This is done by evaluatingsynthdata.base.Synthesizer.match()
for all subclasses ofsynthdata.base.Synthesizer
from most specific subclass to most general superclass. First match is assigned.Todo
Confirm all alternatives defined in
synth_class_map
.A None in
synth_class_map
means an unknown synth, possibly buried in a Union.
- class synthdata.base.SchemaSynthesizer¶
Synthesizes collections of Models.
A Schema has multiple models to handle SQL Foreign Key references to another model’s Primary Key.
The PK is a
synthdata.base.Synthesizer
withsynthdata.base.Pooled
behavior.The FK is a
synthdata.base.SynthesizeReference
that extracts values from the PK’s pool.
First, use
add()
to add aBaseModel
to the schema.Then – after all models have been added – use
rows()
to get rows for aBaseModel
.The
noise
value is the probability of noise – invalid values or None values if None is not permitted.Important
All models with PK’s must be defined before creating rows for any model with FK’s.
It’s best to define all model classes before trying to emit any data.
- __init__() None ¶
- add(model_class: type[BaseModel], rows: int | None = None) None ¶
Adds a
BaseModel
to the working schema.- Raises:
KeyError – if the FK reference (
"Model.field"
) cannot be parsed.
- prepare() None ¶
For Pooled synthesizers, this will populate the pools. Once. After that, it does nothing.
- reset() None ¶
Reset the synthesizers to repopulate pools.
- rows(model_class: type[BaseModel]) Iterator[BaseModel] ¶
Returns the iterator for the
synthdata.synth.SynthesizeModel
. If all FK references are not resolved, will raise an exception.- Raises:
KeyError – if the FK reference (
"Model.field"
) cannot be found.ValueError – if the FK reference is not a Pooled synthesizer.
- data(model_class: type[BaseModel], noise: float = 0.0) Iterator[dict[str, Any]] ¶
Returns an iterator for potential
synthdata.synth.SynthesizeModel
instances. Noise is injected and the values may not be valid.
synths
Module¶
Synthesizer Definitions.
- class synthdata.synths.SynthesizeNone(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
Synthesizer
Returns None. Handles
Optional[type]
andtype | None
annotated types.- initialize()¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__()
.This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- value_gen(sequence: int | None = None) Any ¶
Low-level value synthesis.
- class synthdata.synths.SynthesizeString(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
Synthesizer
Synthesizes printable strings with no whitespace or
\
.Uses
Field
attributesmin_length
(default 1)max_length
(default 32)
- initialize() None ¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__()
.This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- value_gen(sequence: int | None = None) Any ¶
Low-level value synthesis.
- classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool ¶
Generic
Annotated[str, ...]
.
- class synthdata.synths.SynthesizeName(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
SynthesizeString
Synthesizer Name-like strings. This title-cases a random string.
Todo
Improve the name generator with better pattern (and anti-pattern).
Options
get first names from census data; get digraph frequency from last names.
Use NLTK digraph frequencies to generate plausible English-like works.
Default min_length is 3 to avoid 1-char names.
- value_gen(sequence: int | None = None) Any ¶
Low-level value synthesis. Creates length and a string of the desired length in Title Case.
- classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool ¶
Requires
Annotated[str, ...]
andjson_schema_extra
with{"domain": "name"}
- class synthdata.synths.SynthesizeNumber(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
Synthesizer
Abstract number synthesizer.
Uses
Field
attributesge
(default 0)le
(default 2**32 - 1, $2^{32} - 1$.
Uses
json_schema_extra
values"distribution"
– can be"normal"
or"uniform"
.
- initialize() None ¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__()
.This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- class synthdata.synths.SynthesizeInteger(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
SynthesizeNumber
Extends
synthdata.synths.SynthesizeNumber
to offer only integer values.- initialize() None ¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__()
.This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- value_gen(sequence: int | None = None) Any ¶
Creates integer in range with given distribution.
- classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool ¶
Requires
Annotated[int, ...]
.
- class synthdata.synths.SynthesizeFloat(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
SynthesizeNumber
Extends
synthdata.synths.SynthesizeNumber
to offer only float values.- initialize() None ¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__()
.This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- value_gen(sequence: int | None = None) Any ¶
Creates float in range with given distribution.
- classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool ¶
Requires
Annotated[float, ...]
.
- class synthdata.synths.SynthesizeDate(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶
Bases:
SynthesizeNumber
Extends
synthdata.synths.SynthesizeNumber
to offer only datetime.datetime values.Default range is 1970-Jan-1 to 2099-Dec-31.
- initialize() None ¶
Parse the field information and prepare the synthesizer. This is the last step of
__init__()
.This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.
- value_gen(sequence: int | None = None) Any ¶
Creates date in range with given distribution.
- classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool ¶
Requires
Annotated[datetime.datetime, ...]