Synthetic Data Tool Components

The Synthetic Data Tool package contains a number of field-level synthesizers, a model synthesizer, and a schema synthesizer.

top to bottom direction

package "synths" {
    class SynthesizeString
    class SynthesizeName
    SynthesizeString <|-- SynthesizeName
    class SynthesizeNumber
    class SynthesizeInteger
    SynthesizeNumber <|-- SynthesizeInteger
    class SynthesizeReference
}

package "base" {
    class Synthesizer
    class Behavior
    class Pooled
    class Independent
    class SynthesizeModel {
        fields: dict[str, Synthesizer]
    }
    class SynthesizeSchema {
        add(Model)
        rows(Model)
    }

    SynthesizeSchema o-- SynthesizeModel
    SynthesizeModel o-- Synthesizer

    Synthesizer - Behavior
    Behavior <|-- Pooled
    Behavior <|-- Independent
}

synths.SynthesizeString -up-|> base.Synthesizer
synths.SynthesizeNumber -up-|> base.Synthesizer
synths.SynthesizeReference -up-|> base.Synthesizer

base Module

Contains base class definitions for all synthesizers. This includes the field-level synthesizers, plus model-level and schema-level synthesizers.

A field-level synthesizer has two aspects.

  • An essential synth algorithms for various data types. These are generally randomly-generated sequences of values.

  • The behavior of the overall synth process. This falls into two variant strategies:

    • Independent. This means duplicates are possible, and the values are not referenced by other synthesizers.

    • Dependent. (“Pooled”) Duplicates will be prevented, and the pool of values can be referenced by a synthdata.synths.SythesizerReference instance.

In order to handle foreign key references, a synthdata.synths.SynthesizeReference extracts values from a pooled synthesizer.

Important

This relies on Pydantic’s rich set of type annotations.

Behavior Strategy

The synthdata.base.Behavior class hierarchy defines a Strategy that tailors the behavior of a synthdata.base.Synthesizer.

class synthdata.base.Behavior(synth: Synthesizer)

General features of the alternative Strategty plug-ins for a synthdata.base.Synthesizer.

abstract next() Any

Returns a next value for a Synthesizer.

choice() Any

Returns a value chosen from a pool, only overridden by the Pooled subclass.

class synthdata.base.Independent(synth: Synthesizer)

Defines a synthdata.base.Synthesizer that cannot be the target of a SynthesizerReference. Each value is generated randomly, with no assurance of uniqueness.

This is – in effect – the default behavior of a synthdata.base.Synthesizer. The methods are delegated back to the synthdata.base.Synthesizer instance.

__init__(synth: Synthesizer) None

Constructs a Behavior strategy bound to a specific synthdata.base.Synthesizer.

next() Any

Returns a value from synthdata.base.Synthesizer.value_gen().

class synthdata.base.Pooled(synth: Synthesizer)

Defines a synthdata.base.Synthesizer that created a pool of values. This can be the target of a synthdata.base.SynthesizerReference isntance. This requires an associated synthdata.base.SynthesizerModel to provide the target number of rows in the pool.

Values in the pool will be unique.

__init__(synth: Synthesizer) None

Constructs a Behavior strategy bound to a specific synthdata.base.Synthesizer.

fill()

Populate the self.pool collection of unique values. This uses synthdata.base.Synthesizer.value_gen() to build instances. This is used by the model’s _prepare() method.

next() Any

In the event of using next(synth_instance), the synthdata.base.Synthesizer.__next__() picks a value at random from the pool using this.

choice() Any

Returns a value chosen from a pool.

Synthesizer Class

This is the superclass for the various definitions in the synthdata.synths module.

class synthdata.base.Synthesizer(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)

Abstract Base Class for all synthesisers.

__init__(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>) None

Builds a Synthesizer that’s part of a synthdata.base.ModelSynthesizer. This is associated with a specific field of the BaseModel.

The synthdata.base.Behavior is provided as a class name. The instance is created here after the field information is parsed by the prepare() method. This permits a synthdata.base.Pooled synthesizer to immediately create the pool of unique values.

next() Any

Get the next value from the synthdata.base.Behavior Strategy.

choice() Any

Get an arbitrary value from a synthdata.base.Pooled Strategy.

abstract initialize() None

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

abstract value_gen(sequence: int | None = None) Any

Low-level value synthesis.

noise_gen(sequence: int | None = None) Any

Low-level noise synthesis. Pick one of the noise_synth functions.

get_meta(cls_: type[T], getter: Callable[[T], Any]) Any

Extract metadata from the Pydantic FieldInfo instance.

Todo

Reduce reliance on the Pydantic FieldInfo and annotation classes.

Model and Schema Synthesizers

The synthdata.base.BaseModelSynthesizer supports the pydantic BaseModel. The synthdata.base.SchemaSynthesizer permits references among Synthesizers. This permits FK references to PK pools.

synthdata.base.synth_name_map() dict[str, type[Synthesizer]]

Finds all subclasses of synthdata.base.Synthesizer.

This is used once during model and schema creation to create a mapping from a name to a class.

Yields name and class from most specific to most general.

class synthdata.base.DataIter(model: ModelSynthesizer, noise: float = 0.0)

Iterate through values of a synthdata.base.ModelSynthesizer instance. This will create dictionaries suitable for creating Pydantic BaseModel instances.

A trash injector can replace good data with invalid values – bad numbers, bad strings, None, etc. The resulting object will not be a valid Pydantic BaseModel instance.

Todo

Handle recursive structures here.

__init__(model: ModelSynthesizer, noise: float = 0.0) None

Creates an iterator, bound to a synthdata.base.ModelSynthesizer instance.

__next__() dict[str, Any]

Creates the next dict[str, Any] object. Each field is created by the Sythesizer’s attached Behavior.

class synthdata.base.ModelIter(model: ModelSynthesizer, noise: float = 0.0)

Iterate through values of a synthdata.base.ModelSynthesizer instance. This can only create the Pydantic BaseModel instances with valid data.

__init__(model: ModelSynthesizer, noise: float = 0.0) None

Creates an iterator, bound to a synthdata.base.ModelSynthesizer instance.

class synthdata.base.ModelSynthesizer(cls_: type, rows: int | None = None)

Abstract Base Class for various model synthesizers.

abstract __init__(cls_: type, rows: int | None = None) None

Initialize a model synthesizer.

Parameters:
  • cls – a class definition with fields of some kind.

  • rows – the number of rows to generate when creating key pools.

  • noise – the probability of invalid data in the resulting object. Note that non-zero noise values may lead to Pydantic validation errors.

class synthdata.base.BaseModelSynthesizer(cls_: type[BaseModel], rows: int | None = None)

Synthesizes instances of Pydantic BaseModel.

This expects a flat collection of atomic fields. In that way, it’s biased to work with SQL-oriented schema, which tend to be flat.

Todo

Handle recursive structures here.

__init__(cls_: type[BaseModel], rows: int | None = None) None

Initializes a synthesizer for all fields of a given BaseModel. The number of rows is only needed if there are any synthdata.base.Pooled synthesizers, generally because of SQL primary keys.

Construction of the synthesizers is part of initialization. The make_field_synth() method is exposed so subclasses can extend the processing.

The general use case is the use the fields attribute, which is a mapping of field names to synthdata.base.Synthesizer instances.

Todo

Handle nested models.

sql_rule(field: FieldInfo) tuple[type[Behavior] | None, dict[str, type[Synthesizer]] | None]

Rule 0 – PK’s are pooled, FK’s are references to PK pools. This also means "sql": {'key': "foreign", ... will get a reference synthesizer.

Todo

FK’s may have optionality rules.

The SynthesizeReference instance may need a subdomain distribution.

Todo

What kind of error for invalid values?

For now, we simply create an Independent behavior. Perhaps a ValueError is better? Or a wanrning?

explicit_rule(field: FieldInfo) tuple[type[Behavior] | None, dict[str, type[Synthesizer]] | None]

Rule 1 – json_schema_extra names a synthesizer class to use. Generally {“synthesizer”: “SomeSynth”}

match_rule(field: FieldInfo) tuple[type[Behavior] | None, dict[str, type[Synthesizer] | None]]

Rule 2 – deduce synthesizer from annotation class and json_schema_extra. requires map keys names all subclasses first, and names all superclasses last.

Most cases there’s a single {"type": type[Synthesizer]}.

Optionality returns {"type": type[Synthesizer], "NoneType": type[Synthesizer]}. Further details may be required to get the frequency of null values correct.

match field.annotation:
case _UnionGenericAlias() | UnionType():

# If the union has two parts, one of which is None, then it’s simply optional. # Without further details, it’s not clear what fraction should be None. Assume 5%. # In general, a distribution table is required in json_schema_extra. # {"subDomain": {"None": 1, "int": 99}} for simple optional. # {"subDomain": {"int": 3, "float": 1}} for multiple values.

make_field_synth(field: FieldInfo) Synthesizer

Given a specific field, apply the matching rules to locate a synthesizer.

  • Rule 0. PK’s lead to a synthdata.base.Pooled Strategy. FK’s will be references to PK pools. Otherwise, synthesizers are synthdata.base.Independent.

  • Rule 1. json_schema_extra can name a synthesizer to use. The key is "synthesizer". Example: {"synthesizer": "SomeSynth"}

  • Rule 2. Deduce synthesizer from annotation class and json_schema_extra properties. This is done by evaluating synthdata.base.Synthesizer.match() for all subclasses of synthdata.base.Synthesizer from most specific subclass to most general superclass. First match is assigned.

    Todo

    Confirm all alternatives defined in synth_class_map.

    A None in synth_class_map means an unknown synth, possibly buried in a Union.

class synthdata.base.SchemaSynthesizer

Synthesizes collections of Models.

A Schema has multiple models to handle SQL Foreign Key references to another model’s Primary Key.

First, use add() to add a BaseModel to the schema.

Then – after all models have been added – use rows() to get rows for a BaseModel.

The noise value is the probability of noise – invalid values or None values if None is not permitted.

Important

All models with PK’s must be defined before creating rows for any model with FK’s.

It’s best to define all model classes before trying to emit any data.

__init__() None
add(model_class: type[BaseModel], rows: int | None = None) None

Adds a BaseModel to the working schema.

Raises:

KeyError – if the FK reference ("Model.field") cannot be parsed.

prepare() None

For Pooled synthesizers, this will populate the pools. Once. After that, it does nothing.

reset() None

Reset the synthesizers to repopulate pools.

rows(model_class: type[BaseModel]) Iterator[BaseModel]

Returns the iterator for the synthdata.synth.SynthesizeModel. If all FK references are not resolved, will raise an exception.

Raises:
  • KeyError – if the FK reference ("Model.field") cannot be found.

  • ValueError – if the FK reference is not a Pooled synthesizer.

data(model_class: type[BaseModel], noise: float = 0.0) Iterator[dict[str, Any]]

Returns an iterator for potential synthdata.synth.SynthesizeModel instances. Noise is injected and the values may not be valid.

synths Module

Synthesizer Definitions.

class synthdata.synths.SynthesizeNone(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)

Bases: Synthesizer

Returns None. Handles Optional[type] and type | None annotated types.

initialize()

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

value_gen(sequence: int | None = None) Any

Low-level value synthesis.

class synthdata.synths.SynthesizeString(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)

Bases: Synthesizer

Synthesizes printable strings with no whitespace or \.

Uses Field attributes

  • min_length (default 1)

  • max_length (default 32)

initialize() None

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

value_gen(sequence: int | None = None) Any

Low-level value synthesis.

classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool

Generic Annotated[str, ...].

class synthdata.synths.SynthesizeName(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)

Bases: SynthesizeString

Synthesizer Name-like strings. This title-cases a random string.

Todo

Improve the name generator with better pattern (and anti-pattern).

Options

  1. get first names from census data; get digraph frequency from last names.

  2. Use NLTK digraph frequencies to generate plausible English-like works.

Default min_length is 3 to avoid 1-char names.

value_gen(sequence: int | None = None) Any

Low-level value synthesis. Creates length and a string of the desired length in Title Case.

classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool

Requires Annotated[str, ...] and json_schema_extra with {"domain": "name"}

class synthdata.synths.SynthesizeNumber(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)

Bases: Synthesizer

Abstract number synthesizer.

Uses Field attributes

  • ge (default 0)

  • le (default 2**32 - 1, $2^{32} - 1$.

Uses json_schema_extra values

  • "distribution" – can be "normal" or "uniform".

initialize() None

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

class synthdata.synths.SynthesizeInteger(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)

Bases: SynthesizeNumber

Extends synthdata.synths.SynthesizeNumber to offer only integer values.

initialize() None

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

value_gen(sequence: int | None = None) Any

Creates integer in range with given distribution.

classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool

Requires Annotated[int, ...].

class synthdata.synths.SynthesizeFloat(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)

Bases: SynthesizeNumber

Extends synthdata.synths.SynthesizeNumber to offer only float values.

initialize() None

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

value_gen(sequence: int | None = None) Any

Creates float in range with given distribution.

classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool

Requires Annotated[float, ...].

class synthdata.synths.SynthesizeDate(model: ~synthdata.base.ModelSynthesizer, field: ~pydantic.fields.FieldInfo, behavior: type[~synthdata.base.Behavior] = <class 'synthdata.base.Independent'>)

Bases: SynthesizeNumber

Extends synthdata.synths.SynthesizeNumber to offer only datetime.datetime values.

Default range is 1970-Jan-1 to 2099-Dec-31.

initialize() None

Parse the field information and prepare the synthesizer. This is the last step of __init__().

This also prepares the noise generator. If bounds or patterns are provided, and out-of-bounds value or anti-pattern value of the right type. Otherwise, the data is a short, random string of punctuation characters, unlikely to be valid.

value_gen(sequence: int | None = None) Any

Creates date in range with given distribution.

classmethod match(field_type: type, json_schema_extra: dict[str, Any]) bool

Requires Annotated[datetime.datetime, ...]