Synthetic Data Tool Components¶

The Synthetic Data Tool package contains a number of field-level synthesizers, a model synthesizer, and a schema synthesizer.

`base` Module¶

Contains base class definitions for all synthesizers. This includes the field-level synthesizers, plus model-level and schema-level synthesizers.

A field-level synthesizer has two aspects.

An essential synth algorithms for various data types. These are generally randomly-generated sequences of values.
The behavior of the overall synth process. This falls into two variant strategies:
- Independent. This means duplicates are possible, and the values are not referenced by other synthesizers.
- Dependent. (“Pooled”) Duplicates will be prevented, and the pool of values can be referenced by a synthdata.synths.SythesizerReference instance.

In order to handle foreign key references, a synthdata.synths.SynthesizeReference extracts values from a pooled synthesizer.

Important

This relies on Pydantic’s rich set of type annotations.

Behavior Strategy¶

The synthdata.base.Behavior class hierarchy defines a Strategy that tailors the behavior of a synthdata.base.Synthesizer.

class synthdata.base.Behavior(synth: Synthesizer)¶

General features of the alternative Strategty plug-ins for a synthdata.base.Synthesizer.

abstract next() → Any¶: Returns a next value for a Synthesizer.

choice() → Any¶: Returns a value chosen from a pool, only overridden by the Pooled subclass.

class synthdata.base.Independent(synth: Synthesizer)¶

Defines a synthdata.base.Synthesizer that cannot be the target of a SynthesizerReference. Each value is generated randomly, with no assurance of uniqueness.

This is – in effect – the default behavior of a synthdata.base.Synthesizer. The methods are delegated back to the synthdata.base.Synthesizer instance.

__init__(synth: Synthesizer) → None¶: Constructs a Behavior strategy bound to a specific synthdata.base.Synthesizer.

next() → Any¶: Returns a value from synthdata.base.Synthesizer.value_gen().

class synthdata.base.Pooled(synth: Synthesizer)¶

Defines a synthdata.base.Synthesizer that created a pool of values. This can be the target of a synthdata.base.SynthesizerReference isntance. This requires an associated synthdata.base.SynthesizerModel to provide the target number of rows in the pool.

Values in the pool will be unique.

__init__(synth: Synthesizer) → None¶: Constructs a Behavior strategy bound to a specific synthdata.base.Synthesizer.

fill()¶: Populate the self.pool collection of unique values. This uses synthdata.base.Synthesizer.value_gen() to build instances. This is used by the model’s _prepare() method.

next() → Any¶: In the event of using next(synth_instance), the synthdata.base.Synthesizer.__next__() picks a value at random from the pool using this.

choice() → Any¶: Returns a value chosen from a pool.

Synthesizer Class¶

This is the superclass for the various definitions in the synthdata.synths module.

class synthdata.base.Synthesizer(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶

Abstract Base Class for all synthesisers.

__init__(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>) → None¶

Builds a Synthesizer that’s part of a synthdata.base.ModelSynthesizer. This is associated with a specific field of the BaseModel.

The synthdata.base.Behavior is provided as a class name. The instance is created here after the field information is parsed by the prepare() method. This permits a synthdata.base.Pooled synthesizer to immediately create the pool of unique values.

next() → Any¶: Get the next value from the synthdata.base.Behavior Strategy.

choice() → Any¶: Get an arbitrary value from a synthdata.base.Pooled Strategy.

abstract initialize() → None¶

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

abstract value_gen(sequence: int | None = None) → Any¶: Low-level value synthesis.

noise_gen(sequence: int | None = None) → Any¶: Low-level noise synthesis. Pick one of the noise_synth functions.

get_meta(cls_: type[T], getter: Callable[[T], Any]) → Any¶: Extract metadata from the Pydantic FieldInfo instance.

Todo

Reduce reliance on the Pydantic FieldInfo and annotation classes.

Model and Schema Synthesizers¶

The synthdata.base.BaseModelSynthesizer supports the pydantic BaseModel. The synthdata.base.SchemaSynthesizer permits references among Synthesizers. This permits FK references to PK pools.

synthdata.base.synth_name_map() → dict[str, type[Synthesizer]]¶

Finds all subclasses of synthdata.base.Synthesizer.

This is used once during model and schema creation to create a mapping from a name to a class.

Yields name and class from most specific to most general.

class synthdata.base.DataIter(model: ModelSynthesizer, noise: float = 0.0)¶

Iterate through values of a synthdata.base.ModelSynthesizer instance. This will create dictionaries suitable for creating Pydantic BaseModel instances.

A trash injector can replace good data with invalid values – bad numbers, bad strings, None, etc. The resulting object will not be a valid Pydantic BaseModel instance.

Todo

Handle recursive structures here.

__init__(model: ModelSynthesizer, noise: float = 0.0) → None¶: Creates an iterator, bound to a synthdata.base.ModelSynthesizer instance.

__next__() → dict[str, Any]¶: Creates the next dict[str, Any] object. Each field is created by the Sythesizer’s attached Behavior.

class synthdata.base.ModelIter(model: ModelSynthesizer, noise: float = 0.0)¶

Iterate through values of a synthdata.base.ModelSynthesizer instance. This can only create the Pydantic BaseModel instances with valid data.

__init__(model: ModelSynthesizer, noise: float = 0.0) → None¶: Creates an iterator, bound to a synthdata.base.ModelSynthesizer instance.

class synthdata.base.ModelSynthesizer(cls_: type, rows: int | None = None)¶

Abstract Base Class for various model synthesizers.

abstract __init__(cls_: type, rows: int | None = None) → None¶

Initialize a model synthesizer.

Parameters:

cls – a class definition with fields of some kind.
rows – the number of rows to generate when creating key pools.
noise – the probability of invalid data in the resulting object. Note that non-zero noise values may lead to Pydantic validation errors.

class synthdata.base.BaseModelSynthesizer(cls_: type[BaseModel], rows: int | None = None)¶

Synthesizes instances of Pydantic BaseModel.

This expects a flat collection of atomic fields. In that way, it’s biased to work with SQL-oriented schema, which tend to be flat.

Todo

Handle recursive structures here.

__init__(cls_: type[BaseModel], rows: int | None = None) → None¶

Initializes a synthesizer for all fields of a given BaseModel. The number of rows is only needed if there are any synthdata.base.Pooled synthesizers, generally because of SQL primary keys.

Construction of the synthesizers is part of initialization. The make_field_synth() method is exposed so subclasses can extend the processing.

The general use case is the use the fields attribute, which is a mapping of field names to synthdata.base.Synthesizer instances.

Todo

Handle nested models.

sql_rule(field: FieldInfo) → tuple[type[Behavior] | None, dict[str, type[Synthesizer]] | None]¶: Rule 0 – PK’s are pooled, FK’s are references to PK pools. This also means "sql": {'key': "foreign", ... will get a reference synthesizer.

Todo

FK’s may have optionality rules.

The SynthesizeReference instance may need a subdomain distribution.

Todo

What kind of error for invalid values?

For now, we simply create an Independent behavior. Perhaps a ValueError is better? Or a wanrning?

explicit_rule(field: FieldInfo) → tuple[type[Behavior] | None, dict[str, type[Synthesizer]] | None]¶: Rule 1 – json_schema_extra names a synthesizer class to use. Generally {“synthesizer”: “SomeSynth”}

match_rule(field: FieldInfo) → tuple[type[Behavior] | None, dict[str, type[Synthesizer] | None]]¶

Rule 2 – deduce synthesizer from annotation class and json_schema_extra. requires map keys names all subclasses first, and names all superclasses last.

Most cases there’s a single {"type": type[Synthesizer]}.

Optionality returns {"type": type[Synthesizer], "NoneType": type[Synthesizer]}. Further details may be required to get the frequency of null values correct.

match field.annotation:

case _UnionGenericAlias() | UnionType():
# If the union has two parts, one of which is None, then it’s simply optional. # Without further details, it’s not clear what fraction should be None. Assume 5%. # In general, a distribution table is required in json_schema_extra. # {"subDomain": {"None": 1, "int": 99}} for simple optional. # {"subDomain": {"int": 3, "float": 1}} for multiple values.

make_field_synth(field: FieldInfo) → Synthesizer¶

Given a specific field, apply the matching rules to locate a synthesizer.

Rule 0. PK’s lead to a synthdata.base.Pooled Strategy. FK’s will be references to PK pools. Otherwise, synthesizers are synthdata.base.Independent.
Rule 1. json_schema_extra can name a synthesizer to use. The key is "synthesizer". Example: {"synthesizer": "SomeSynth"}
Rule 2. Deduce synthesizer from annotation class and json_schema_extra properties. This is done by evaluating synthdata.base.Synthesizer.match() for all subclasses of synthdata.base.Synthesizer from most specific subclass to most general superclass. First match is assigned.

Todo

Confirm all alternatives defined in synth_class_map.

A None in synth_class_map means an unknown synth, possibly buried in a Union.

class synthdata.base.SchemaSynthesizer¶

Synthesizes collections of Models.

A Schema has multiple models to handle SQL Foreign Key references to another model’s Primary Key.

The PK is a synthdata.base.Synthesizer with synthdata.base.Pooled behavior.
The FK is a synthdata.base.SynthesizeReference that extracts values from the PK’s pool.

First, use add() to add a BaseModel to the schema.

Then – after all models have been added – use rows() to get rows for a BaseModel.

The noise value is the probability of noise – invalid values or None values if None is not permitted.

Important

All models with PK’s must be defined before creating rows for any model with FK’s.

It’s best to define all model classes before trying to emit any data.

__init__() → None¶

add(model_class: type[BaseModel], rows: int | None = None) → None¶

Adds a BaseModel to the working schema.

Raises:: KeyError – if the FK reference ("Model.field") cannot be parsed.

prepare() → None¶: For Pooled synthesizers, this will populate the pools. Once. After that, it does nothing.

reset() → None¶: Reset the synthesizers to repopulate pools.

rows(model_class: type[BaseModel]) → Iterator[BaseModel]¶

Returns the iterator for the synthdata.synth.SynthesizeModel. If all FK references are not resolved, will raise an exception.

Raises:

KeyError – if the FK reference ("Model.field") cannot be found.
ValueError – if the FK reference is not a Pooled synthesizer.

data(model_class: type[BaseModel], noise: float = 0.0) → Iterator[dict[str, Any]]¶: Returns an iterator for potential synthdata.synth.SynthesizeModel instances. Noise is injected and the values may not be valid.

`synths` Module¶

Synthesizer Definitions.

class synthdata.synths.SynthesizeNone(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶

Bases: Synthesizer

Returns None. Handles Optional[type] and type | None annotated types.

initialize()¶

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

value_gen(sequence: int | None = None) → Any¶: Low-level value synthesis.

class synthdata.synths.SynthesizeString(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶

Bases: Synthesizer

Synthesizes printable strings with no whitespace or \.

Uses Field attributes

min_length (default 1)
max_length (default 32)

initialize() → None¶

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

value_gen(sequence: int | None = None) → Any¶: Low-level value synthesis.

classmethod match(field_type: type, json_schema_extra: dict[str, Any]) → bool¶: Generic Annotated[str, ...].

class synthdata.synths.SynthesizeName(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶

Bases: SynthesizeString

Synthesizer Name-like strings. This title-cases a random string.

Todo

Improve the name generator with better pattern (and anti-pattern).

Options

get first names from census data; get digraph frequency from last names.
Use NLTK digraph frequencies to generate plausible English-like works.

Default min_length is 3 to avoid 1-char names.

value_gen(sequence: int | None = None) → Any¶: Low-level value synthesis. Creates length and a string of the desired length in Title Case.

classmethod match(field_type: type, json_schema_extra: dict[str, Any]) → bool¶: Requires Annotated[str, ...] and json_schema_extra with {"domain": "name"}

class synthdata.synths.SynthesizeNumber(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶

Bases: Synthesizer

Abstract number synthesizer.

Uses Field attributes

ge (default 0)
le (default 2**32 - 1, $2^{32} - 1$.

Uses json_schema_extra values

"distribution" – can be "normal" or "uniform".

initialize() → None¶

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

class synthdata.synths.SynthesizeInteger(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶

Bases: SynthesizeNumber

Extends synthdata.synths.SynthesizeNumber to offer only integer values.

initialize() → None¶

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

value_gen(sequence: int | None = None) → Any¶: Creates integer in range with given distribution.

classmethod match(field_type: type, json_schema_extra: dict[str, Any]) → bool¶: Requires Annotated[int, ...].

class synthdata.synths.SynthesizeFloat(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶

Bases: SynthesizeNumber

Extends synthdata.synths.SynthesizeNumber to offer only float values.

initialize() → None¶

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

value_gen(sequence: int | None = None) → Any¶: Creates float in range with given distribution.

classmethod match(field_type: type, json_schema_extra: dict[str, Any]) → bool¶: Requires Annotated[float, ...].

class synthdata.synths.SynthesizeDate(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)¶

Bases: SynthesizeNumber

Extends synthdata.synths.SynthesizeNumber to offer only datetime.datetime values.

Default range is 1970-Jan-1 to 2099-Dec-31.

initialize() → None¶

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

value_gen(sequence: int | None = None) → Any¶: Creates date in range with given distribution.

classmethod match(field_type: type, json_schema_extra: dict[str, Any]) → bool¶: Requires Annotated[datetime.datetime, ...]

Synthetic Data Tool Components¶

`base` Module¶

Behavior Strategy¶

Synthesizer Class¶

Model and Schema Synthesizers¶

`synths` Module¶

Synthetic Data Tools

Navigation

Related Topics

Synthetic Data Tool Components¶

base Module¶

Behavior Strategy¶

Synthesizer Class¶

Model and Schema Synthesizers¶

synths Module¶

`base` Module¶

`synths` Module¶