Future Directions¶
Strategic¶
See Model Definition for a list of features.
Additional string formats and patterns.
Additional numeric distributions.
Additional date, datetime, and time features.
Enumerated value with a distribution histogram.
Optional values are a more subtle aspect of a domain definition.
A domain-indepedent null is the SQL
NULL
or PythonNone
value. This can be done with a JSONSchemaoneOf
and ajson_schema_extra
to provide probability of anull
.More generally, it requires a
oneOf
with probabilities for each alternative. This leads to aAnnnotated[Union[int, None, ...], etc.]
with probabilities for the two alternatives.A domain-specific null is a coded value, like social security number
999-99-9999
that indicates some sort of missing or not-applicable value. This is also a complicatedUnion
. This leads to aUnion[Annnotated[int, ...], Annotated[Literal[n], ...], etc.]
with probabilities for each choice.
Tactical¶
Todo
Reduce reliance on the Pydantic FieldInfo
and annotation classes.
(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base, line 48.)
Todo
Handle recursive structures here.
(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base.DataIter, line 7.)
Todo
Handle recursive structures here.
(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base.BaseModelSynthesizer, line 6.)
Todo
Handle nested models.
(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base.BaseModelSynthesizer.__init__, line 11.)
Todo
FK’s may have optionality rules.
The SynthesizeReference
instance may need a subdomain distribution.
(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base.BaseModelSynthesizer.sql_rule, line 4.)
Todo
What kind of error for invalid values?
For now, we simply create an Independent behavior. Perhaps a ValueError is better? Or a wanrning?
(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base.BaseModelSynthesizer.sql_rule, line 8.)
Todo
Confirm all alternatives defined in synth_class_map
.
A None in synth_class_map
means an unknown synth, possibly buried in a Union.
(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/base.py:docstring of synthdata.base.BaseModelSynthesizer.make_field_synth, line 16.)
Todo
Improve the name generator with better pattern (and anti-pattern).
Options
get first names from census data; get digraph frequency from last names.
Use NLTK digraph frequencies to generate plausible English-like works.
(The original entry is located in /Users/slott/github/local/DataSynthTool/docs/../src/synthdata/synths.py:docstring of synthdata.synths.SynthesizeName, line 3.)