language.py¶
Implements a specific ConLang for a series of books. This includes a few grammar rules and lexicon rules to create words.
Background¶
This is a for a specific Epic Fantasy series. It implements an imperial trade language shared by a handful of kingdoms all under control of the “Western Empire.”
The essential language has a relatively simple structure.
Sentences have forms like VP NP NP, VP NP PP, VP NP PP NP PP. The verb is always first. Nouns follow, generally decorated with prepositions.
There are some additional grammar rules for noun phrases to inject determiners to help keep the subjects and objects stright.
NP -> Det Nominal ;
Nominal -> Nominal Noun ;
A transliteration of “Kill the mage” would look like this:
“go-Kill you mage-the”
There’s a verb, “go-Kill”, and two nouns, “you” and “mage-the”.
Verbs:
Present tense is the base form of verbs.
Imperative gets a “go-” prefix.
Present participle (“-ing” in English) gets a “now-” prefix.
Past (“-ed” in English) gets a “did-” prefix.
There’s no person conjugation. It’s a separate word and comes right after the verb.
Generally, the ConLang lacks intransitive verbs. “did-Bark the dog at something.” “did-See myself the man.” “did-Give the dog to a man.” “did-Say the man did-Bark the dog at something.”
Nouns determiners (this, that, the, any, all, etc.) are suffixed onto the noun.
Implementation¶
ConLang Grammar and Lexicon.
See https://www.nltk.org/book/ch05.html for the Parts-of-Speech Tags
Also, see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html for a more complete set.
See https://www.cis.upenn.edu/~bies/manuals/tagguide.pdf
Also https://universaldependencies.org/u/pos/all.html
The tricky part is – of course – the grammar.
Our goal is to move from a tagged English parse tree to the ConLang tree through a number of transformations.
Input
(S (VP (TV kill) (NP (DET the) (N mage))))
In English, this Transitive Verb example doesn’t have a subject only an object.
English permits S -> VP NP as well as more the more conventional S -> NP VP structures.
See https://ucrel.lancs.ac.uk/bnc2/bnc2guide.htm where VVI/VVZ is used for imperative.
The target ConLang in this implementation is VP NP NP, VP NP PP, VP NP PP NP PP kinds of things. Note the recursive
NP -> Det Nominal and Nominal -> Nominal Noun to permit constructing complicated relationships.
“go-Kill you the mage” sorts of constructs. Present is the base form. Imperative is a “go-” prefix. Present participle (“-ing” in English) is a “now-” prefix. Past (“-ed” in English) is a “did-” prefix. No person conjugation. It’s a separate word after the verb.
Symbol |
Meaning |
Example |
IV |
intransitive verb |
barked |
TV |
transitive verb |
saw a man |
DatV |
dative verb |
gave a dog to a man |
SV |
sentential verb |
said that a dog barked |
Generally, this example ConLang lacks Intransitive verbs. “did-Bark the dog at something.” “did-See myself the man.” “did-Give the dog to a man.” “did-Say the man did-Bark the dog at something.”
Rules needed¶
Pivot Declarative to VP first.
(S (NP $n) (VP $v) $x) -> (S (VP $v) (NP $n) $x). Imperative(S (VP))left alone. Interrogative(S (Aux NP VP)),(S (Wh-NP VP)),(S (Wh-NP Aux NP VP))should also be swapped around to verb first.Adverbs are split out with a helper.
(VP $x (ADVP (NP $y))) -> (VP $x (PP with (NP $y))).The Verb Subcategories rules, above.
Apply Tense Prefix to base Verb’s stem word.
(VP (MD $m) (VP $v)) -> (VP ${f|prefix(m)}))rewrite verb root with mode prefix.Apply noun determiner suffix to nounds
(NP (DET $d) (N $n)) -> (NP ${n|suffix(d)}).
Part 1 – Lexicon¶
To create words, we use a simple weighted choice random selection.
- language.weighted_choice(source: List[Tuple[CT, int]]) CT[source]¶
Given [(string, int), …] weighted strings, pick a string.
A language.WordMaker builds words from a specific set of digraph frequencies.
- class language.WordMaker(make_seed: Callable[[str], int])[source]¶
Create words from a few rules.
Markov chains based on digraphs.
Initial Letter thrown on kind of randomly.
Seeded RNG from source word.
- digraph_text = 'Digraph\tCount\t \tDigraph\tFrequency\nth\t5532\t \tth\t1.52\nhe\t4657\t \the\t1.28\nin\t3429\t \tin\t0.94\ner\t3420\t \ter\t0.94\nan\t3005\t \tan\t0.82\nre\t2465\t \tre\t0.68\nnd\t2281\t \tnd\t0.63\nat\t2155\t \tat\t0.59\non\t2086\t \ton\t0.57\nnt\t2058\t \tnt\t0.56\nha\t2040\t \tha\t0.56\nes\t2033\t \tes\t0.56\nst\t2009\t \tst\t0.55\nen\t2005\t \ten\t0.55\ned\t1942\t \ted\t0.53\nto\t1904\t \tto\t0.52\nit\t1822\t \tit\t0.50\nou\t1820\t \tou\t0.50\nea\t1720\t \tea\t0.47\nhi\t1690\t \thi\t0.46\nis\t1660\t \tis\t0.46\nor\t1556\t \tor\t0.43\nti\t1231\t \tti\t0.34\nas\t1211\t \tas\t0.33\nte\t985\t \tte\t0.27\net\t704\t \tet\t0.19\nng\t668\t \tng\t0.18\nof\t569\t \tof\t0.16\nal\t341\t \tal\t0.09\nde\t332\t \tde\t0.09\nse\t300\t \tse\t0.08\nle\t298\t \tle\t0.08\nsa\t215\t \tsa\t0.06\nsi\t186\t \tsi\t0.05\nar\t157\t \tar\t0.04\nve\t148\t \tve\t0.04\nra\t137\t \tra\t0.04\nld\t64\t \tld\t0.02\nur\t60\t \tur\t0.02\n'¶
- first_letter_text = 'Letter\tFrequency\nz\t0.034%\ny\t1.620%\nx\t0.017%\nw\t6.753%\nv\t0.649%\nu\t1.487%\nt\t16.671%\ns\t7.755%\nr\t1.653%\nq\t0.173%\np\t2.545%\no\t6.264%\nn\t2.365%\nm\t4.383%\nl\t2.705%\nk\t0.590%\nj\t0.597%\ni\t6.286%\nh\t7.232%\ng\t1.950%\nf\t3.779%\ne\t2.007%\nd\t2.670%\nc\t3.511%\nb\t4.702%\na\t11.602%\n'¶
- lengths = [(2, 200), (3, 182), (4, 164), (5, 146), (6, 128), (7, 110), (8, 92), (9, 74), (10, 56), (11, 38), (12, 20)]¶
Two seed-generating functions to build a language.WordMaker instance.
Part 2 – Grammar¶
- class language.Tag(pos: str, words: list[str | Tag])[source]¶
Tagged text that forms a tree. (Similar to NLTK.Tree.)
- pos: str¶
Alias for field number 0
Example of tagged input:
>>> from language import Tag
>>> src = '(S (VP (TV kill) (NP (DET the) (NP mage))))'
>>> t = Tag.from_text(src)
>>> t.clean()
'kill the mage'
- class language.TransformRule(source: str, target: str)[source]¶
Find a structure like (TAG $x $y) and transform to (TAG $y $x).
This has a number of limitations, of course. Principally, it doesn’t use Chomsky Normal Form. In CNF, every production having either two non-terminals or one terminal on the right-hand side. This seems similar to the way lambda calculus rewrites higher-arity operators as single-operand lambdas.
- match(tag_source: Tag, some_content: Tag) bool[source]¶
Match the entire tag pattern and the content being examined.
- placeholders(tag_source: Tag, some_content: Tag, variables: Dict[str, Tag | str]) None[source]¶
Given a pattern and tagged content, locate and assign values to $ placeholders. This does a recursive depth-first search.
Example of a transformation:
>>> from language import TransformRule
>>> rule1 = TransformRule("(S (NP $n) (VP $v $n2))", "(S (VP $v) (NP $n) (PP a $n2))")
>>> s1 = Tag.from_text("(S (NP I) (VP am (NP groot)))")
>>> xform_s1.clean()
'am I a groot'