Future Directions

Strategic

See Model Definition for a list of features.

  • Additional string formats and patterns.

  • Additional numeric distributions.

  • Additional date, datetime, and time features.

  • Enumerated value with a distribution histogram.

  • Optional values are a more subtle aspect of a domain definition.

    • A domain-indepedent null is the SQL NULL or Python None value. This can be done with a JSONSchema oneOf and a json_schema_extra to provide probability of a null.

      More generally, it requires a oneOf with probabilities for each alternative. This leads to a Annnotated[Union[int, None, ...], etc.] with probabilities for the two alternatives.

    • A domain-specific null is a coded value, like social security number 999-99-9999 that indicates some sort of missing or not-applicable value. This is also a complicated Union. This leads to a Union[Annnotated[int, ...], Annotated[Literal[n], ...], etc.] with probabilities for each choice.

Tactical

Todo

Reduce reliance on the Pydantic FieldInfo and annotation classes.

(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base, line 48.)

Todo

Handle recursive structures here.

(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base.DataIter, line 7.)

Todo

Handle recursive structures here.

(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base.BaseModelSynthesizer, line 6.)

Todo

Handle nested models.

(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base.BaseModelSynthesizer.__init__, line 11.)

Todo

FK’s may have optionality rules.

The SynthesizeReference instance may need a subdomain distribution.

(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base.BaseModelSynthesizer.sql_rule, line 4.)

Todo

What kind of error for invalid values?

For now, we simply create an Independent behavior. Perhaps a ValueError is better? Or a wanrning?

(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base.BaseModelSynthesizer.sql_rule, line 8.)

Todo

Confirm all alternatives defined in synth_class_map.

A None in synth_class_map means an unknown synth, possibly buried in a Union.

(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base.BaseModelSynthesizer.make_field_synth, line 16.)

Todo

Improve the name generator with better pattern (and anti-pattern).

Options

  1. get first names from census data; get digraph frequency from last names.

  2. Use NLTK digraph frequencies to generate plausible English-like works.

(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/synths.py:docstring of synthdata.synths.SynthesizeName, line 3.)