stingray.cobol_parser

cobol_parser – COBOL DDE Parser and JSONSchema Builder.

Translate COBOL to JSON Schema. This involves the following kinds of transformations:

  • Group level items become “type”: “object”.

  • Elementary items become one of the atomic types, “string”, “integer”, “number”. If an extened vocabulary is used, then “decimal” can be used, also.

  • Occurs items become "type": "array". There are additional special cases.

    • An item with both OCCURS and a PICTURE becomes an anonymous array that contains the elementary item.

    • OCCURS DEPENDING ON builds a "$ref": "#name" to a "$anchor": "name" item in the schema.

  • REDEFINES refers to another item under this parent. While this is similar to a “oneOf” definition, it’s a bit more complex because the alternatives each have separate names. The structure is not simply a {"name": {"type": {"oneOf": [base-A, redefine-B, redefine-C, etc.]}}. The “redefines” property is effectively anonymous and each of the subtypes has a distinct name. It’s {"redefine-A-B-C": {"type": {"oneOf": [{"type": "object", "properties": {"A": base}}, {"type: "object", "properties": {"B": redefined}}, etc.]}}}. This is cumbersome, but is required to capture the COBOL semantics accurately in JSONSchema.

We require some extensions or adaptations to cover two COBOL issues:

  • COBOL encoded data (Packed-Decimal, Binary, etc.) JSON Schema presumes delimited files with a parser’s conversions of data. For COBOL, the parsing is driven from the JSON Schema, therefore additional details are required. This includes the contentEncoding and a conversion function.

  • Occurs Depending On reference. JSON Schema limits the maxItems to an unsigned integer. We have to provide an alternative keyword for this.

It is also handy to have a “cobol” keyword with the original source text.

Because COBOL flattens the namespace of records, we define an $anchor for each individual field to make them easier to search for.

Approach

We use a (long) regular expression to parse the various clauses. This defines the entire DDE syntax.

Beyond the essential language syntax, there’s a “reference format” for source code. For this format, we need to remove positions 1-6 and 72-80. Position 7 may involve a comment indicator, “*”, or a continuation character, “-“. See https://www.ibm.com/docs/en/cobol-zos/4.2?topic=structure-reference-format.

COBOL Language

A COBOL “Copybook” is a group-level DDE. See https://www.ibm.com/docs/en/cobol-zos/4.2?topic=division-data-data-description-entry

There are three formats for DDE’s. We only really care about one of them.

  • Format 1 is the useful DDE level numbers 01 to 49 and 77.

  • Format 2 is a RENAMES clause, level 66. We don’t support this.

. Format 3 is a CONDITION, level 88. This is a kind of enumeration of values; we tolerate it, but don’t do anything with it.

Here’s the railroad diagramm for each sentence. Clauses after level-number and data-name-1 can appear in any order.

>>-level-number--+-------------+--+------------------+---------->
                 +-data-name-1-+  '-redefines-clause-'
                 '-FILLER------'

>--+------------------------+--+-----------------+-------------->
   '-blank-when-zero-clause-'  '-external-clause-'

>--+---------------+--+--------------------+-------------------->
   '-global-clause-'  '-group-usage-clause-'

>--+------------------+--+---------------+---------------------->
   '-justified-clause-'  '-occurs-clause-'

>--+----------------+--+-------------+-------------------------->
   '-picture-clause-'  '-sign-clause-'

>--+---------------------+--+--------------+-------------------->
   '-synchronized-clause-'  '-usage-clause-'

>--+--------------+--+--------------------+--------------------><
   '-value-clause-'  '-date-format-clause-'

A separator period occurs at the end of the sentence. (It’s described elsewhere in the COBOL language reference.)

For simple examples, see https://github.com/rradclif/mortgagesample/tree/master/MortgageApplication/copybook

For comprehensive, complex examples, see https://github.com/royopa/cb2xml/tree/ec83af657b781afd0dad9cc263623faa2549f738/source/cb2xml_tests/src/common/cobolCopybook

These cover a large number of COBOL-to-XML cases.

reference_format function

stingray.cobol_parser.reference_format(source: TextIO, replacing: list[tuple[str, str]] | None = None) Iterator[str]

Extract source from files that have sequence numbers in 1-6, indicator in 7, and code in 8-72. Zero-based, these slices are [0:6], [6], [7:72]

This can be extended to handle COPY statements that include other copybooks into a copybook.

Parameters:
  • source – The source file

  • replacing – A sequence of two-tuples with (“‘old’”, “new”) strings. The apostrophes on the old are required here to replace the apostrophes that are present in the COBOL source.

Yields:

strings with the sequence and indicator removed.

Parsing

stingray.cobol_parser.dde_sentences(source: Iterable[str]) Iterator[Sequence[str]]

Decompose the source into separate sentences by looking for the trailing period-space. The pattern will produce a sequence of (level, source text) 2-tuples. Since we simply collect all the matching groups, it’s technically a Sequence[str].

stingray.cobol_parser.expand_repeat(group_dict: dict[str, str]) dict[str, str]

Replace {“repeat”: “x(y)”} with {“digit”: “xxx…x”}

>>> expand_repeat({'repeat': '9(5)'})
{'digit': '99999'}
>>> expand_repeat({'repeat': '9(0005)'})
{'digit': '99999'}
stingray.cobol_parser.pass_non_empty(group_dict: dict[str, str]) dict[str, str]

Pass dictionary items with non-empty values; reject items with empty values.

>>> pass_non_empty({"a": "b", "empty": None})
{'a': 'b'}
stingray.cobol_parser.normalize_picture(source: str) list[dict[str, str]]

Parse PICTURE clause into component pieces to make it easier to work with. This breaks down a complex mask into individual pieces.

>>> normalize_picture("9(5)")
[{'digit': '99999'}]
>>> normalize_picture("S9(5)V99")
[{'sign': 'S'}, {'digit': '99999'}, {'decimal': 'V'}, {'digit': '99'}]
>>> normalize_picture("S9(0005)V9(0002)")
[{'sign': 'S'}, {'digit': '99999'}, {'decimal': 'V'}, {'digit': '99'}]
stingray.cobol_parser.clause_dict(source: str) dict[str, str | list[dict[str, str]]]

Expand a COBOL DDE sentence into a dict of clauses and values. This tends to preserve much (but not all) of the source syntax.

  1. Non-space separators (, or ;) are dropped.

  2. Some productions don’t have all the values captured. The ASCENDING/DESCENDING KEY options in OCCURS, for example, are stripped away.

High Level Parsing

stingray.cobol_parser.structure(sentences: Iterable[Sequence[str]]) list[DDE]

Create a list of DDE trees from a sequence of lines. Each DDE contains zero or more children.

We update the start of a REDEFINES union. We don’t know X will be redefined until we encounter “Y REDEFINES X”.

The sentence regular expression produces two-tuples. Since we use the simple groups() function, however, it’s technically a Sequence[str].

stingray.cobol_parser.schema_iter(source: ~typing.TextIO, deformat: ~collections.abc.Callable[[~typing.TextIO, list[tuple[str, str]] | None], ~collections.abc.Iterator[str]] = <function reference_format>) Iterator[None | bool | int | float | str | list[Any] | dict[str, Any]]

DDE Class

class stingray.cobol_parser.DDE(*sentence: str, clauses: dict[str, Any] | None = None)

An instance of a COBOL DDE.

For name == “FILLER”, this assigns a unique internal name. A class-level counter is used.

Note that level of “01” resets the counter.

append(child: DDE) None
static display(node: DDE, indent: int = 0) None
filler_count = 0

JSONSchemaMaker Class

class stingray.cobol_parser.JSONSchemaMaker(unpacker: type[~stingray.schema_instance.Unpacker[~stingray.schema_instance.NDInstance]] = <class 'stingray.schema_instance.EBCDIC'>)

Translate COBOL DDE to JSONSchema.

This handles REDEFINES and OCCURS DEPENDING ON.

  • REDEFINES becomes a oneOf with the alternatives.

  • OCCURS DEPENDING On uses a maxItemsDependsOn vocabulary extension.

COBOL “flattens” the namespace so an elementary name implies the path to that name. This is done with the “$anchor” keyword to mark the visible names.

The default is EBCDIC encodings.

build_json_schema(node: DDE, path: tuple[str, ...] = (), ignore_redefines: bool = False) None | bool | int | float | str | list[Any] | dict[str, Any]

Emit a JSON schema that reflects a COBOL DDE and all nested DDE’s within it.

Computes maxLength and minLength from the "contentEncoding" and "cobol" fields. Uses the supplied schema_instance.Unpacker.

  • For contentEncoding reflecting EBCDIC, this will use estruct.calcsize().

  • For contentEncoding reflecting ASCII or Unicode, this will use struct.calcsize().

json_type(node: DDE) None | bool | int | float | str | list[Any] | dict[str, Any]

The JSON Schema type for an elementary (or atomic) field.

If we’re using a standard JSON Schema validator (without decimal as part of the vocabulary), the following mappings are used:

COBOL encoded numeric can be "type": "string" with additional "contentEncoding" details.

The "contentEncoding" describes COBOL Packed Decimal and Binary as strings of bytes.

A "cobol" keyword gets Usage and Picture values required to decode EBCDIC.

A "conversion" keyword converts to a more useful Python type from raw file strings.

Here’s an example:

{"title": "SOME-FIELD",
 "$anchor": "SOME-FIELD",
 "cobol": "05 SOME-FIELD USAGE COMP-3 PIC S999V99",

 "type": "string",
 "contentEncoding": "packed-decimal",
 "conversion": "decimal"
}

The title, anchor, and cobol are defined separately. This function provides the type, encoding, and conversion.

Other "contentEncoding" values include “bigendian-int”. Also, “bigendian-float” and “bigendian-double”. And, of course, “CP037” or “EBCDIC” to decode ordinary strings from EBCDIC to native text.

If USAGE DISPLAY and PIC has only SVP9: zoned decimal: “type”: “string”, “contentEncoding”: “cp037”, “conversion”: “decimal”, If USAGE DISPLAY: “string”, “contentEncoding”: “cp037” If USAGE COMP-3, COMPUTATIONAL-3, PACKED-DECIMAL: “type”: “string”, “contentEncoding”: “packed-decimal”, “conversion”: “decimal” If USAGE COMP-4, COMPUTATIONAL-4, COMP, COMPUTATIONAL, BINARY: “integer”, “contentEncoding”: “bigendian-int” If USAGE COMP-1, COMPUTATIONAL-1, COMP-2, COMPUTATIONAL-2: “number”, “contentEncoding”: “bigendian-float” or “bigendian-double”

jsonschema(source: DDE) None | bool | int | float | str | list[Any] | dict[str, Any]
class stingray.cobol_parser.JSONSchemaMakerExtendedVocabulary(unpacker: type[~stingray.schema_instance.Unpacker[~stingray.schema_instance.NDInstance]] = <class 'stingray.schema_instance.EBCDIC'>)

A JSONSchemaMaker with an extended, non-standard vocabulary.

VOCABULARY: None | bool | int | float | str | list[Any] | dict[str, Any] = {}
json_type(node: DDE) None | bool | int | float | str | list[Any] | dict[str, Any]

If we’re using an extended vocabulary including decimal, the following mappings can be used.

If USAGE DISPLAY and PIC has only SVP9: zoned “decimal”. If USAGE DISPLAY: “string”. If USAGE COMP-3, COMPUTATIONAL-3, PACKED-DECIMAL: “decimal”. If USAGE COMP-4, COMPUTATIONAL-4, COMP, COMPUTATIONAL, BINARY: “integer”. If USAGE COMP-1, COMPUTATIONAL-1, COMP-2, COMPUTATIONAL-2: “number”.

Example:

{
    "type": json_type(node),
    "cobol": f"{node.level} {node.name} {node.source}",
}

Exceptions

exception stingray.cobol_parser.DesignError

This is a catastrophic design problem. A common root cause is a named REGEX capture clause that’s not properly handled by a class, method, or function.