
cobol_parser – COBOL DDE Parser and JSONSchema Builder.

Translate COBOL to JSON Schema. This involves the following kinds of transformations:

  • Group level items become “type”: “object”.

  • Elementary items become one of the atomic types, “string”, “integer”, “number”. If an extened vocabulary is used, then “decimal” can be used, also.

  • Occurs items become "type": "array". There are additional special cases.

    • An item with both OCCURS and a PICTURE becomes an anonymous array that contains the elementary item.

    • OCCURS DEPENDING ON builds a "$ref": "#name" to a "$anchor": "name" item in the schema.

  • REDEFINES refers to another item under this parent. While this is similar to a “oneOf” definition, it’s a bit more complex because the alternatives each have separate names. The structure is not simply a {"name": {"type": {"oneOf": [base-A, redefine-B, redefine-C, etc.]}}. The “redefines” property is effectively anonymous and each of the subtypes has a distinct name. It’s {"redefine-A-B-C": {"type": {"oneOf": [{"type": "object", "properties": {"A": base}}, {"type: "object", "properties": {"B": redefined}}, etc.]}}}. This is cumbersome, but is required to capture the COBOL semantics accurately in JSONSchema.

We require some extensions or adaptations to cover two COBOL issues:

  • COBOL encoded data (Packed-Decimal, Binary, etc.) JSON Schema presumes delimited files with a parser’s conversions of data. For COBOL, the parsing is driven from the JSON Schema, therefore additional details are required. This includes the contentEncoding and a conversion function.

  • Occurs Depending On reference. JSON Schema limits the maxItems to an unsigned integer. We have to provide an alternative keyword for this.

It is also handy to have a “cobol” keyword with the original source text.

Because COBOL flattens the namespace of records, we define an $anchor for each individual field to make them easier to search for.


We use a (long) regular expression to parse the various clauses. This defines the entire DDE syntax.

Beyond the essential language syntax, there’s a “reference format” for source code. For this format, we need to remove positions 1-6 and 72-80. Position 7 may involve a comment indicator, “*”, or a continuation character, “-“. See

COBOL Language

A COBOL “Copybook” is a group-level DDE. See

There are three formats for DDE’s. We only really care about one of them.

  • Format 1 is the useful DDE level numbers 01 to 49 and 77.

  • Format 2 is a RENAMES clause, level 66. We don’t support this.

. Format 3 is a CONDITION, level 88. This is a kind of enumeration of values; we tolerate it, but don’t do anything with it.

Here’s the railroad diagramm for each sentence. Clauses after level-number and data-name-1 can appear in any order.

                 +-data-name-1-+  '-redefines-clause-'

   '-blank-when-zero-clause-'  '-external-clause-'

   '-global-clause-'  '-group-usage-clause-'

   '-justified-clause-'  '-occurs-clause-'

   '-picture-clause-'  '-sign-clause-'

   '-synchronized-clause-'  '-usage-clause-'

   '-value-clause-'  '-date-format-clause-'

A separator period occurs at the end of the sentence. (It’s described elsewhere in the COBOL language reference.)

For simple examples, see

For comprehensive, complex examples, see

These cover a large number of COBOL-to-XML cases.

reference_format function

stingray.cobol_parser.reference_format(source: TextIO, replacing: list[tuple[str, str]] | None = None) Iterator[str]

Extract source from files that have sequence numbers in 1-6, indicator in 7, and code in 8-72. Zero-based, these slices are [0:6], [6], [7:72]

This can be extended to handle COPY statements that include other copybooks into a copybook.

  • source – The source file

  • replacing – A sequence of two-tuples with (“‘old’”, “new”) strings. The apostrophes on the old are required here to replace the apostrophes that are present in the COBOL source.


strings with the sequence and indicator removed.


stingray.cobol_parser.dde_sentences(source: Iterable[str]) Iterator[Sequence[str]]

Decompose the source into separate sentences by looking for the trailing period-space. The pattern will produce a sequence of (level, source text) 2-tuples. Since we simply collect all the matching groups, it’s technically a Sequence[str].

stingray.cobol_parser.expand_repeat(group_dict: dict[str, str]) dict[str, str]

Replace {“repeat”: “x(y)”} with {“digit”: “xxx…x”}

>>> expand_repeat({'repeat': '9(5)'})
{'digit': '99999'}
>>> expand_repeat({'repeat': '9(0005)'})
{'digit': '99999'}
stingray.cobol_parser.pass_non_empty(group_dict: dict[str, str]) dict[str, str]

Pass dictionary items with non-empty values; reject items with empty values.

>>> pass_non_empty({"a": "b", "empty": None})
{'a': 'b'}
stingray.cobol_parser.normalize_picture(source: str) list[dict[str, str]]

Parse PICTURE clause into component pieces to make it easier to work with. This breaks down a complex mask into individual pieces.

>>> normalize_picture("9(5)")
[{'digit': '99999'}]
>>> normalize_picture("S9(5)V99")
[{'sign': 'S'}, {'digit': '99999'}, {'decimal': 'V'}, {'digit': '99'}]
>>> normalize_picture("S9(0005)V9(0002)")
[{'sign': 'S'}, {'digit': '99999'}, {'decimal': 'V'}, {'digit': '99'}]
stingray.cobol_parser.clause_dict(source: str) dict[str, str | list[dict[str, str]]]

Expand a COBOL DDE sentence into a dict of clauses and values. This tends to preserve much (but not all) of the source syntax.

  1. Non-space separators (, or ;) are dropped.

  2. Some productions don’t have all the values captured. The ASCENDING/DESCENDING KEY options in OCCURS, for example, are stripped away.

High Level Parsing

stingray.cobol_parser.structure(sentences: Iterable[Sequence[str]]) list[DDE]

Create a list of DDE trees from a sequence of lines. Each DDE contains zero or more children.

We update the start of a REDEFINES union. We don’t know X will be redefined until we encounter “Y REDEFINES X”.

The sentence regular expression produces two-tuples. Since we use the simple groups() function, however, it’s technically a Sequence[str].

stingray.cobol_parser.schema_iter(source: ~typing.TextIO, deformat:[[~typing.TextIO, list[tuple[str, str]] | None],[str]] = <function reference_format>) Iterator[None | bool | int | float | str | list[Any] | dict[str, Any]]

DDE Class

class stingray.cobol_parser.DDE(*sentence: str, clauses: dict[str, Any] | None = None)

An instance of a COBOL DDE.

For name == “FILLER”, this assigns a unique internal name. A class-level counter is used.

Note that level of “01” resets the counter.

append(child: DDE) None
static display(node: DDE, indent: int = 0) None
filler_count = 0

JSONSchemaMaker Class

class stingray.cobol_parser.JSONSchemaMaker(unpacker: type[~stingray.schema_instance.Unpacker[~stingray.schema_instance.NDInstance]] = <class 'stingray.schema_instance.EBCDIC'>)

Translate COBOL DDE to JSONSchema.


  • REDEFINES becomes a oneOf with the alternatives.

  • OCCURS DEPENDING On uses a maxItemsDependsOn vocabulary extension.

COBOL “flattens” the namespace so an elementary name implies the path to that name. This is done with the “$anchor” keyword to mark the visible names.

The default is EBCDIC encodings.

build_json_schema(node: DDE, path: tuple[str, ...] = (), ignore_redefines: bool = False) None | bool | int | float | str | list[Any] | dict[str, Any]

Emit a JSON schema that reflects a COBOL DDE and all nested DDE’s within it.

Computes maxLength and minLength from the "contentEncoding" and "cobol" fields. Uses the supplied schema_instance.Unpacker.

  • For contentEncoding reflecting EBCDIC, this will use estruct.calcsize().

  • For contentEncoding reflecting ASCII or Unicode, this will use struct.calcsize().

json_type(node: DDE) None | bool | int | float | str | list[Any] | dict[str, Any]

The JSON Schema type for an elementary (or atomic) field.

If we’re using a standard JSON Schema validator (without decimal as part of the vocabulary), the following mappings are used:

COBOL encoded numeric can be "type": "string" with additional "contentEncoding" details.

The "contentEncoding" describes COBOL Packed Decimal and Binary as strings of bytes.

A "cobol" keyword gets Usage and Picture values required to decode EBCDIC.

A "conversion" keyword converts to a more useful Python type from raw file strings.

Here’s an example:

{"title": "SOME-FIELD",
 "$anchor": "SOME-FIELD",
 "cobol": "05 SOME-FIELD USAGE COMP-3 PIC S999V99",

 "type": "string",
 "contentEncoding": "packed-decimal",
 "conversion": "decimal"

The title, anchor, and cobol are defined separately. This function provides the type, encoding, and conversion.

Other "contentEncoding" values include “bigendian-int”. Also, “bigendian-float” and “bigendian-double”. And, of course, “CP037” or “EBCDIC” to decode ordinary strings from EBCDIC to native text.

If USAGE DISPLAY and PIC has only SVP9: zoned decimal: “type”: “string”, “contentEncoding”: “cp037”, “conversion”: “decimal”, If USAGE DISPLAY: “string”, “contentEncoding”: “cp037” If USAGE COMP-3, COMPUTATIONAL-3, PACKED-DECIMAL: “type”: “string”, “contentEncoding”: “packed-decimal”, “conversion”: “decimal” If USAGE COMP-4, COMPUTATIONAL-4, COMP, COMPUTATIONAL, BINARY: “integer”, “contentEncoding”: “bigendian-int” If USAGE COMP-1, COMPUTATIONAL-1, COMP-2, COMPUTATIONAL-2: “number”, “contentEncoding”: “bigendian-float” or “bigendian-double”

jsonschema(source: DDE) None | bool | int | float | str | list[Any] | dict[str, Any]
class stingray.cobol_parser.JSONSchemaMakerExtendedVocabulary(unpacker: type[~stingray.schema_instance.Unpacker[~stingray.schema_instance.NDInstance]] = <class 'stingray.schema_instance.EBCDIC'>)

A JSONSchemaMaker with an extended, non-standard vocabulary.

VOCABULARY: None | bool | int | float | str | list[Any] | dict[str, Any] = {}
json_type(node: DDE) None | bool | int | float | str | list[Any] | dict[str, Any]

If we’re using an extended vocabulary including decimal, the following mappings can be used.



    "type": json_type(node),
    "cobol": f"{node.level} {} {node.source}",


exception stingray.cobol_parser.DesignError

This is a catastrophic design problem. A common root cause is a named REGEX capture clause that’s not properly handled by a class, method, or function.