stingray.cobol_parser¶
cobol_parser – COBOL DDE Parser and JSONSchema Builder.
Translate COBOL to JSON Schema. This involves the following kinds of transformations:
Group level items become “type”: “object”.
Elementary items become one of the atomic types, “string”, “integer”, “number”. If an extened vocabulary is used, then “decimal” can be used, also.
Occurs items become
"type": "array"
. There are additional special cases.An item with both OCCURS and a PICTURE becomes an anonymous array that contains the elementary item.
OCCURS DEPENDING ON
builds a"$ref": "#name"
to a"$anchor": "name"
item in the schema.
REDEFINES
refers to another item under this parent. While this is similar to a “oneOf” definition, it’s a bit more complex because the alternatives each have separate names. The structure is not simply a{"name": {"type": {"oneOf": [base-A, redefine-B, redefine-C, etc.]}}
. The “redefines” property is effectively anonymous and each of the subtypes has a distinct name. It’s{"redefine-A-B-C": {"type": {"oneOf": [{"type": "object", "properties": {"A": base}}, {"type: "object", "properties": {"B": redefined}}, etc.]}}}
. This is cumbersome, but is required to capture the COBOL semantics accurately in JSONSchema.
We require some extensions or adaptations to cover two COBOL issues:
COBOL encoded data (Packed-Decimal, Binary, etc.) JSON Schema presumes delimited files with a parser’s conversions of data. For COBOL, the parsing is driven from the JSON Schema, therefore additional details are required. This includes the
contentEncoding
and a conversion function.Occurs Depending On reference. JSON Schema limits the maxItems to an unsigned integer. We have to provide an alternative keyword for this.
It is also handy to have a “cobol” keyword with the original source text.
Because COBOL flattens the namespace of records, we define an $anchor
for each individual field to make them easier to search for.
Approach¶
We use a (long) regular expression to parse the various clauses. This defines the entire DDE syntax.
Beyond the essential language syntax, there’s a “reference format” for source code. For this format, we need to remove positions 1-6 and 72-80. Position 7 may involve a comment indicator, “*”, or a continuation character, “-“. See https://www.ibm.com/docs/en/cobol-zos/4.2?topic=structure-reference-format.
COBOL Language¶
A COBOL “Copybook” is a group-level DDE. See https://www.ibm.com/docs/en/cobol-zos/4.2?topic=division-data-data-description-entry
There are three formats for DDE’s. We only really care about one of them.
Format 1 is the useful DDE level numbers 01 to 49 and 77.
Format 2 is a RENAMES clause, level 66. We don’t support this.
. Format 3 is a CONDITION, level 88. This is a kind of enumeration of values; we tolerate it, but don’t do anything with it.
Here’s the railroad diagramm for each sentence. Clauses after level-number and data-name-1 can appear in any order.
>>-level-number--+-------------+--+------------------+---------->
+-data-name-1-+ '-redefines-clause-'
'-FILLER------'
>--+------------------------+--+-----------------+-------------->
'-blank-when-zero-clause-' '-external-clause-'
>--+---------------+--+--------------------+-------------------->
'-global-clause-' '-group-usage-clause-'
>--+------------------+--+---------------+---------------------->
'-justified-clause-' '-occurs-clause-'
>--+----------------+--+-------------+-------------------------->
'-picture-clause-' '-sign-clause-'
>--+---------------------+--+--------------+-------------------->
'-synchronized-clause-' '-usage-clause-'
>--+--------------+--+--------------------+--------------------><
'-value-clause-' '-date-format-clause-'
A separator period occurs at the end of the sentence. (It’s described elsewhere in the COBOL language reference.)
For simple examples, see https://github.com/rradclif/mortgagesample/tree/master/MortgageApplication/copybook
For comprehensive, complex examples, see https://github.com/royopa/cb2xml/tree/ec83af657b781afd0dad9cc263623faa2549f738/source/cb2xml_tests/src/common/cobolCopybook
These cover a large number of COBOL-to-XML cases.
reference_format function¶
- stingray.cobol_parser.reference_format(source: TextIO, replacing: list[tuple[str, str]] | None = None) Iterator[str] ¶
Extract source from files that have sequence numbers in 1-6, indicator in 7, and code in 8-72. Zero-based, these slices are [0:6], [6], [7:72]
This can be extended to handle
COPY
statements that include other copybooks into a copybook.- Parameters:
source – The source file
replacing – A sequence of two-tuples with (“‘old’”, “new”) strings. The apostrophes on the old are required here to replace the apostrophes that are present in the COBOL source.
- Yields:
strings with the sequence and indicator removed.
Parsing¶
- stingray.cobol_parser.dde_sentences(source: Iterable[str]) Iterator[Sequence[str]] ¶
Decompose the source into separate sentences by looking for the trailing period-space. The pattern will produce a sequence of (level, source text) 2-tuples. Since we simply collect all the matching groups, it’s technically a Sequence[str].
- stingray.cobol_parser.expand_repeat(group_dict: dict[str, str]) dict[str, str] ¶
Replace {“repeat”: “x(y)”} with {“digit”: “xxx…x”}
>>> expand_repeat({'repeat': '9(5)'}) {'digit': '99999'} >>> expand_repeat({'repeat': '9(0005)'}) {'digit': '99999'}
- stingray.cobol_parser.pass_non_empty(group_dict: dict[str, str]) dict[str, str] ¶
Pass dictionary items with non-empty values; reject items with empty values.
>>> pass_non_empty({"a": "b", "empty": None}) {'a': 'b'}
- stingray.cobol_parser.normalize_picture(source: str) list[dict[str, str]] ¶
Parse PICTURE clause into component pieces to make it easier to work with. This breaks down a complex mask into individual pieces.
>>> normalize_picture("9(5)") [{'digit': '99999'}] >>> normalize_picture("S9(5)V99") [{'sign': 'S'}, {'digit': '99999'}, {'decimal': 'V'}, {'digit': '99'}] >>> normalize_picture("S9(0005)V9(0002)") [{'sign': 'S'}, {'digit': '99999'}, {'decimal': 'V'}, {'digit': '99'}]
- stingray.cobol_parser.clause_dict(source: str) dict[str, str | list[dict[str, str]]] ¶
Expand a COBOL DDE sentence into a dict of clauses and values. This tends to preserve much (but not all) of the source syntax.
Non-space separators (, or ;) are dropped.
Some productions don’t have all the values captured. The ASCENDING/DESCENDING KEY options in OCCURS, for example, are stripped away.
High Level Parsing¶
- stingray.cobol_parser.structure(sentences: Iterable[Sequence[str]]) list[DDE] ¶
Create a list of DDE trees from a sequence of lines. Each DDE contains zero or more children.
We update the start of a REDEFINES union. We don’t know X will be redefined until we encounter “Y REDEFINES X”.
The sentence regular expression produces two-tuples. Since we use the simple groups() function, however, it’s technically a Sequence[str].
- stingray.cobol_parser.schema_iter(source: ~typing.TextIO, deformat: ~collections.abc.Callable[[~typing.TextIO, list[tuple[str, str]] | None], ~collections.abc.Iterator[str]] = <function reference_format>) Iterator[None | bool | int | float | str | list[Any] | dict[str, Any]] ¶
DDE Class¶
JSONSchemaMaker Class¶
- class stingray.cobol_parser.JSONSchemaMaker(unpacker: type[~stingray.schema_instance.Unpacker[~stingray.schema_instance.NDInstance]] = <class 'stingray.schema_instance.EBCDIC'>)¶
Translate COBOL DDE to JSONSchema.
This handles REDEFINES and OCCURS DEPENDING ON.
REDEFINES becomes a
oneOf
with the alternatives.OCCURS DEPENDING On uses a
maxItemsDependsOn
vocabulary extension.
COBOL “flattens” the namespace so an elementary name implies the path to that name. This is done with the “$anchor” keyword to mark the visible names.
The default is EBCDIC encodings.
- build_json_schema(node: DDE, path: tuple[str, ...] = (), ignore_redefines: bool = False) None | bool | int | float | str | list[Any] | dict[str, Any] ¶
Emit a JSON schema that reflects a COBOL DDE and all nested DDE’s within it.
Computes maxLength and minLength from the
"contentEncoding"
and"cobol"
fields. Uses the suppliedschema_instance.Unpacker
.For contentEncoding reflecting EBCDIC, this will use
estruct.calcsize()
.For contentEncoding reflecting ASCII or Unicode, this will use
struct.calcsize()
.
- json_type(node: DDE) None | bool | int | float | str | list[Any] | dict[str, Any] ¶
The JSON Schema type for an elementary (or atomic) field.
If we’re using a standard JSON Schema validator (without
decimal
as part of the vocabulary), the following mappings are used:COBOL encoded numeric can be
"type": "string"
with additional"contentEncoding"
details.The
"contentEncoding"
describes COBOL Packed Decimal and Binary as strings of bytes.A
"cobol"
keyword gets Usage and Picture values required to decode EBCDIC.A
"conversion"
keyword converts to a more useful Python type from raw file strings.Here’s an example:
{"title": "SOME-FIELD", "$anchor": "SOME-FIELD", "cobol": "05 SOME-FIELD USAGE COMP-3 PIC S999V99", "type": "string", "contentEncoding": "packed-decimal", "conversion": "decimal" }
The title, anchor, and cobol are defined separately. This function provides the type, encoding, and conversion.
Other
"contentEncoding"
values include “bigendian-int”. Also, “bigendian-float” and “bigendian-double”. And, of course, “CP037” or “EBCDIC” to decode ordinary strings from EBCDIC to native text.If USAGE DISPLAY and PIC has only SVP9: zoned decimal: “type”: “string”, “contentEncoding”: “cp037”, “conversion”: “decimal”, If USAGE DISPLAY: “string”, “contentEncoding”: “cp037” If USAGE COMP-3, COMPUTATIONAL-3, PACKED-DECIMAL: “type”: “string”, “contentEncoding”: “packed-decimal”, “conversion”: “decimal” If USAGE COMP-4, COMPUTATIONAL-4, COMP, COMPUTATIONAL, BINARY: “integer”, “contentEncoding”: “bigendian-int” If USAGE COMP-1, COMPUTATIONAL-1, COMP-2, COMPUTATIONAL-2: “number”, “contentEncoding”: “bigendian-float” or “bigendian-double”
- class stingray.cobol_parser.JSONSchemaMakerExtendedVocabulary(unpacker: type[~stingray.schema_instance.Unpacker[~stingray.schema_instance.NDInstance]] = <class 'stingray.schema_instance.EBCDIC'>)¶
A JSONSchemaMaker with an extended, non-standard vocabulary.
- VOCABULARY: None | bool | int | float | str | list[Any] | dict[str, Any] = {}¶
- json_type(node: DDE) None | bool | int | float | str | list[Any] | dict[str, Any] ¶
If we’re using an extended vocabulary including
decimal
, the following mappings can be used.If USAGE DISPLAY and PIC has only SVP9: zoned “decimal”. If USAGE DISPLAY: “string”. If USAGE COMP-3, COMPUTATIONAL-3, PACKED-DECIMAL: “decimal”. If USAGE COMP-4, COMPUTATIONAL-4, COMP, COMPUTATIONAL, BINARY: “integer”. If USAGE COMP-1, COMPUTATIONAL-1, COMP-2, COMPUTATIONAL-2: “number”.
Example:
{ "type": json_type(node), "cobol": f"{node.level} {node.name} {node.source}", }
Exceptions¶
- exception stingray.cobol_parser.DesignError¶
This is a catastrophic design problem. A common root cause is a named REGEX capture clause that’s not properly handled by a class, method, or function.