language.py

Implements a specific ConLang for a series of books. This includes a few grammar rules and lexicon rules to create words.

Background

This is a for a specific Epic Fantasy series. It implements an imperial trade language shared by a handful of kingdoms all under control of the “Western Empire.”

The essential language has a relatively simple structure.

Sentences have forms like VP NP NP, VP NP PP, VP NP PP NP PP. The verb is always first. Nouns follow, generally decorated with prepositions.

There are some additional grammar rules for noun phrases to inject determiners to help keep the subjects and objects stright.

NP -> Det Nominal ;
Nominal -> Nominal Noun ;

A transliteration of “Kill the mage” would look like this:

“go-Kill you mage-the”

There’s a verb, “go-Kill”, and two nouns, “you” and “mage-the”.

Verbs:

  • Present tense is the base form of verbs.

  • Imperative gets a “go-” prefix.

  • Present participle (“-ing” in English) gets a “now-” prefix.

  • Past (“-ed” in English) gets a “did-” prefix.

There’s no person conjugation. It’s a separate word and comes right after the verb.

Generally, the ConLang lacks intransitive verbs. “did-Bark the dog at something.” “did-See myself the man.” “did-Give the dog to a man.” “did-Say the man did-Bark the dog at something.”

Nouns determiners (this, that, the, any, all, etc.) are suffixed onto the noun.

Implementation

ConLang Grammar and Lexicon.

See https://www.nltk.org/book/ch05.html for the Parts-of-Speech Tags

Also, see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html for a more complete set.

See https://www.cis.upenn.edu/~bies/manuals/tagguide.pdf

Also https://universaldependencies.org/u/pos/all.html

The tricky part is – of course – the grammar.

Our goal is to move from a tagged English parse tree to the ConLang tree through a number of transformations.

Input

(S (VP (TV kill) (NP (DET the) (N mage))))

In English, this Transitive Verb example doesn’t have a subject only an object. English permits S -> VP NP as well as more the more conventional S -> NP VP structures.

See https://ucrel.lancs.ac.uk/bnc2/bnc2guide.htm where VVI/VVZ is used for imperative.

The target ConLang in this implementation is VP NP NP, VP NP PP, VP NP PP NP PP kinds of things. Note the recursive NP -> Det Nominal and Nominal -> Nominal Noun to permit constructing complicated relationships.

“go-Kill you the mage” sorts of constructs. Present is the base form. Imperative is a “go-” prefix. Present participle (“-ing” in English) is a “now-” prefix. Past (“-ed” in English) is a “did-” prefix. No person conjugation. It’s a separate word after the verb.

Verb Subcategories

Symbol

Meaning

Example

IV

intransitive verb

barked (VP -> IV Adj)

TV

transitive verb

saw a man (VP -> TV NP)

DatV

dative verb

gave a dog to a man (VP -> NP PP)

SV

sentential verb

said that a dog barked (V -> SV S)

Generally, this example ConLang lacks Intransitive verbs. “did-Bark the dog at something.” “did-See myself the man.” “did-Give the dog to a man.” “did-Say the man did-Bark the dog at something.”

Rules needed

  • Pivot Declarative to VP first. (S (NP $n) (VP $v) $x) -> (S (VP $v) (NP $n) $x). Imperative (S (VP)) left alone. Interrogative (S (Aux NP VP)), (S (Wh-NP VP)), (S (Wh-NP Aux NP VP)) should also be swapped around to verb first.

  • Adverbs are split out with a helper. (VP $x (ADVP (NP $y))) -> (VP $x (PP with (NP $y))).

  • The Verb Subcategories rules, above.

  • Apply Tense Prefix to base Verb’s stem word. (VP (MD $m) (VP $v)) -> (VP ${f|prefix(m)})) rewrite verb root with mode prefix.

  • Apply noun determiner suffix to nounds (NP (DET $d) (N $n)) -> (NP ${n|suffix(d)}).

Part 1 – Lexicon

To create words, we use a simple weighted choice random selection.

language.weighted_choice(source: List[Tuple[CT, int]]) CT[source]

Given [(string, int), …] weighted strings, pick a string.

A language.WordMaker builds words from a specific set of digraph frequencies.

class language.WordMaker(make_seed: Callable[[str], int])[source]

Create words from a few rules.

  1. Markov chains based on digraphs.

  2. Initial Letter thrown on kind of randomly.

  3. Seeded RNG from source word.

digraph_text = 'Digraph\tCount\t \tDigraph\tFrequency\nth\t5532\t \tth\t1.52\nhe\t4657\t \the\t1.28\nin\t3429\t \tin\t0.94\ner\t3420\t \ter\t0.94\nan\t3005\t \tan\t0.82\nre\t2465\t \tre\t0.68\nnd\t2281\t \tnd\t0.63\nat\t2155\t \tat\t0.59\non\t2086\t \ton\t0.57\nnt\t2058\t \tnt\t0.56\nha\t2040\t \tha\t0.56\nes\t2033\t \tes\t0.56\nst\t2009\t \tst\t0.55\nen\t2005\t \ten\t0.55\ned\t1942\t \ted\t0.53\nto\t1904\t \tto\t0.52\nit\t1822\t \tit\t0.50\nou\t1820\t \tou\t0.50\nea\t1720\t \tea\t0.47\nhi\t1690\t \thi\t0.46\nis\t1660\t \tis\t0.46\nor\t1556\t \tor\t0.43\nti\t1231\t \tti\t0.34\nas\t1211\t \tas\t0.33\nte\t985\t \tte\t0.27\net\t704\t \tet\t0.19\nng\t668\t \tng\t0.18\nof\t569\t \tof\t0.16\nal\t341\t \tal\t0.09\nde\t332\t \tde\t0.09\nse\t300\t \tse\t0.08\nle\t298\t \tle\t0.08\nsa\t215\t \tsa\t0.06\nsi\t186\t \tsi\t0.05\nar\t157\t \tar\t0.04\nve\t148\t \tve\t0.04\nra\t137\t \tra\t0.04\nld\t64\t \tld\t0.02\nur\t60\t \tur\t0.02\n'
first_letter_text = 'Letter\tFrequency\nz\t0.034%\ny\t1.620%\nx\t0.017%\nw\t6.753%\nv\t0.649%\nu\t1.487%\nt\t16.671%\ns\t7.755%\nr\t1.653%\nq\t0.173%\np\t2.545%\no\t6.264%\nn\t2.365%\nm\t4.383%\nl\t2.705%\nk\t0.590%\nj\t0.597%\ni\t6.286%\nh\t7.232%\ng\t1.950%\nf\t3.779%\ne\t2.007%\nd\t2.670%\nc\t3.511%\nb\t4.702%\na\t11.602%\n'
lengths = [(2, 200), (3, 182), (4, 164), (5, 146), (6, 128), (7, 110), (8, 92), (9, 74), (10, 56), (11, 38), (12, 20)]
__init__(make_seed: Callable[[str], int]) None[source]

Prepare the generator using a seed-creating function and loading the frequency tables.

Parameters:

make_seed – a function to transform English word to a seed for generating a ConLang word.

load_digraph_markov() None[source]
load_first_letters() None[source]
word(seed: str = '') str[source]

Expand weighted Markov chains through the digraphs.

Two seed-generating functions to build a language.WordMaker instance.

language.naive_seed(original: str) int[source]

Transform source word into RNG seed. Seems to have too many collisions.

language.hash_seed(original: str) int[source]

Transform source word into RNG seed.

Part 2 – Grammar

class language.Tag(pos: str, words: list[str | Tag])[source]

Tagged text that forms a tree. (Similar to NLTK.Tree.)

pos: str

Alias for field number 0

words: list[str | Tag]

Alias for field number 1

static from_text(source: str) Tag[source]

Parse (tag (tag word) (tag word)) kinds of structures.

static from_symbols(symbols: List[str]) Tag[source]
clean() str[source]

Example of tagged input:

>>> from language import Tag
>>> src = '(S (VP (TV kill) (NP (DET the) (NP mage))))'
>>> t = Tag.from_text(src)
>>> t.clean()
'kill the mage'
class language.TransformRule(source: str, target: str)[source]

Find a structure like (TAG $x $y) and transform to (TAG $y $x).

This has a number of limitations, of course. Principally, it doesn’t use Chomsky Normal Form. In CNF, every production having either two non-terminals or one terminal on the right-hand side. This seems similar to the way lambda calculus rewrites higher-arity operators as single-operand lambdas.

match(tag_source: Tag, some_content: Tag) bool[source]

Match the entire tag pattern and the content being examined.

placeholders(tag_source: Tag, some_content: Tag, variables: Dict[str, Tag | str]) None[source]

Given a pattern and tagged content, locate and assign values to $ placeholders. This does a recursive depth-first search.

emit(tag_target: Tag, variables: Dict[str, Tag | str]) Tag[source]

Replace $ placeholders with values. Apply functions like ${x|stem}

apply(some_content: Tag) Tag[source]

Apply this rule to some tagged content.

Example of a transformation:

>>> from language import TransformRule
>>> rule1 = TransformRule("(S (NP $n) (VP $v $n2))", "(S (VP $v) (NP $n) (PP a $n2))")
>>> s1 = Tag.from_text("(S (NP I) (VP am (NP groot)))")
>>> xform_s1.clean()
'am I a groot'