stingray.workbook

workbook – Row, Sheet, and Workbook Definitions.

The Stingray Reader treats all workbooks alike. This means wrapping CSV files, Numbers_13, ODS, and XLSX files with a Facade of Workbook, Sheet, and Row These classes describe the relevant features of these file formats.

This includes the following built-in workbook formats:

  • CSV via built-in csv module.

  • NDJSON a/k/a JSON Newline via the built-in json module.

  • COBOL files in EBCDIC via the stingray.estruct module.

  • COBOL files in native Text

  • TAB via CSV reader with dialetc options

  • Pure Text via External Schema. See workbook.simple and the metadata in simple.csv.

Other formats are in the implementations module.

COBOL File Implementation – EBCDIC and Text

The EBCDIC files require specific physical “Record Format” (RECFM) assistance. These classes define a number of Z/OS RECFM conversion. The estruct module has classes for four actual RECFM’s plus an additional special case.

  • estruct.RECFM_F - Fixed.

  • estruct.RECFM_FB - Fixed Blocked.

  • estruct.RECFM_V - Variable, data has RDW preceeding each record.

  • estruct.RECFM_VB - Variable Blocked, data must have BDW and RDW words.

  • estruct.RECFM_N - Variable, but no BDW or RDW words. This involves some buffer management magic to recover the records properly. This is required to handle Occurs Depending On cases where there’s no V or VB header. This requires the consumer of bytes to announce how many bytes where needed so the reader can advance an appropriate amount.

The definitions are mostly protocols to handle non-delimited (i.e., COBOL files), delimited files, and workbooks. This depends on the lower-level Schema, Instance, Nav, and Location constructs.

@startuml
    abstract class Schema
    abstract class Unpacker
    class Location
    Location --> Schema
    'abstract class NDInstance
    'Location -> NDInstance
    class LocationMaker {
        from_instance(instance): Location
    }
    Unpacker --> LocationMaker
    LocationMaker --> "n" Location : creates
    /'Details...
    abstract class Nav
    class NDNav
    class DNav
    class WBNav
    Nav <|-- NDNav
    Nav <|-- DNav
    Nav <|-- WBNav

    Unpacker --> NDNav : creates
    Unpacker --> DNav : creates

    NDNav --> Location

    class JSON
    DNav --> JSON
    '/

    class File
    abstract class Workbook ##[bold]blue

    Workbook --> Unpacker
    Workbook --> File : opens
    Unpacker --> File : reads

    class Sheet ##[bold]blue {
        row_iter(): Row
    }
    Workbook *-- "1:n" Sheet
    Sheet --> Schema
    class Row  ##[bold]blue {
        name(): Any
    }
    Sheet *-- "n" Row
    Row --> NDNav : "[non-delimited]"
    Row --> DNav : "[delimited]"
    Row --> WBNav : "[workbook]"
    class EmbeddedSchemaSheet
    class ExternalSchemaSheet
    Sheet <|-- EmbeddedSchemaSheet
    Sheet <|-- ExternalSchemaSheet
    class SchemaLoader
    EmbeddedSchemaSheet --> SchemaLoader
    ExternalSchemaSheet --> SchemaLoader
    SchemaLoader --> Schema : creates
@enduml

Here are some additional side-bar considerations for other formats that depend on external modules.

  • CSV is built-in. We permit kwargs to provide additional dialect details.

  • JSON is built-in. We’ll treate newline delimited JSON like CSV.

  • XLS is a weird proprietary thing. The xlrd project (https://pypi.org/project/xlrd/) supports it.

  • ODS and XLSX are XML files. Incremental parsing is helpful here because they can be large. See https://openpyxl.readthedocs.io/en/stable/ and http://docs.pyexcel.org/en/v0.0.6-rc2/.

  • Numbers uses protobuf. The legacy version of Stingray Reader had protobuf definitions which appear to work and a snappy decoder. See https://pypi.org/project/numbers-parser/ for the better solution currently in use.

  • TOML requires an external library. An Unpacker subclass can decompose a “one big list” TOML document into individual rows.

  • YAML requires an external library. We’ll use the iterative parser as a default. An Unpacker subclass can decompose a “one big list” YAML document into individual rows.

  • XML is built-in. A schema can drive navgiation through the XML document, name the various tags of interest. Other tags which may be present would be ignored.

A given workbook has two possible sources for a schema: internal and external. An internal schema might be the first row or it might require more sophisticated parsing. An external schema might be hard-coded in the application, or might be a separate document with its own meta-schema.

Generally, the schema applies to a sheet (or a Table in a Workspace for Numbers.)

API Concept

A “fluent” interface is used to open a sheet, extract a header, and process rows. The alternative is to open a sheet, apply the externally loaded schema, and use this to process the sheet’s rows. Once a sheet has been bound to a schema, the rows can be processed.

Internal, Embedded Schema:

>>> from stingray import open_workbook, HeadingRowSchemaLoader, Row
    >>> from pathlib import Path
    >>> import os
    >>> from typing import Iterable
    >>> source_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_quartet_data.csv"

    >>> def process_sheet(rows: Iterable[Row]) -> None:
    ...     for row in rows:
    ...         row.dump()
    ...         break  # Stop after 1 row.

    >>> with open_workbook(source_path) as workbook:
    ...    sheet = workbook.sheet('Sheet1')
    ...    _ = sheet.set_schema_loader(HeadingRowSchemaLoader())
    ...    process_sheet(sheet.rows())
    Field                                                 Value
    object
      x123                                                '10.0'
      y1                                                  '8.04'
      y2                                                  '9.14'
      y3                                                  '7.46'
      x4                                                  '8.0'
      y4                                                  '6.58'


In this case, the :py
>>> from pathlib import Path
>>> import os
>>> from typing import Iterable
>>> source_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_quartet_data.csv"

>>> def process_sheet(rows: Iterable[Row]) -> None:
...     for row in rows:
...         row.dump()
...         break  # Stop after 1 row.

>>> with open_workbook(source_path) as workbook:
...    sheet = workbook.sheet('Sheet1')
...    _ = sheet.set_schema_loader(HeadingRowSchemaLoader())
...    process_sheet(sheet.rows())
Field                                                 Value
object
  x123                                                '10.0'
  y1                                                  '8.04'
  y2                                                  '9.14'
  y3                                                  '7.46'
  x4                                                  '8.0'
  y4                                                  '6.58'

In this case, the rows() method of the Sheet instance will exclude the header rows consumed by the HeadingRowSchemaLoader.

External Schema:

>>> from stingray import open_workbook, ExternalSchemaLoader, Row, SchemaMaker
>>> from pathlib import Path
>>> import os
>>> from typing import Iterable
>>> source_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_quartet_data.csv"
>>> schema_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_schema.csv"

>>> with open_workbook(schema_path) as metaschema_workbook:
...     schema_sheet = metaschema_workbook.sheet('Sheet1')
...     _ = schema_sheet.set_schema(SchemaMaker().from_json(ExternalSchemaLoader.META_SCHEMA))
...     json_schema = ExternalSchemaLoader(schema_sheet).load()
...     schema = SchemaMaker().from_json(json_schema)

>>> with open_workbook(source_path) as workbook:
...     sheet = workbook.sheet('Sheet1').set_schema(schema)
...     process_sheet(sheet.rows())
Field                                                 Value
object
  x123                                                'x123'
  y1                                                  'y1'
  y2                                                  'y2'
  y3                                                  'y3'
  x4                                                  'x4'
  y4                                                  'y4'

In this case, the :py
>>> from pathlib import Path
>>> import os
>>> from typing import Iterable
>>> source_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_quartet_data.csv"
>>> schema_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_schema.csv"

>>> with open_workbook(schema_path) as metaschema_workbook:
...     schema_sheet = metaschema_workbook.sheet('Sheet1')
...     _ = schema_sheet.set_schema(SchemaMaker().from_json(ExternalSchemaLoader.META_SCHEMA))
...     json_schema = ExternalSchemaLoader(schema_sheet).load()
...     schema = SchemaMaker().from_json(json_schema)

>>> with open_workbook(source_path) as workbook:
...     sheet = workbook.sheet('Sheet1').set_schema(schema)
...     process_sheet(sheet.rows())
Field                                                 Value
object
  x123                                                'x123'
  y1                                                  'y1'
  y2                                                  'y2'
  y3                                                  'y3'
  x4                                                  'x4'
  y4                                                  'y4'

In this case, the rows() method of the Sheet instance will include all rows. This means the header row is treated like data when an external schema is applied.

The process_sheet() method:

def process_sheet(rows: Iterator[Row]) -> None:
    for row in rows:
        print(f'{row.name("field").value()=}')

The name() method returns a Nav object. This has a value() method that will extract the value for a given attribute.

Sheet and Row

The Sheet class contained metadata about the sheet, and a row iterator. It’s generally used like this:

def process_sheet(sheet: Sheet) -> None:
    for row in sheet.rows():
        process_row(row)

A Row contains an instance that’s bound to the :py:class:Unpacker and the Schema. This will build Nav objects for navigation. Where necessary, this may involve creating Location objects as part of NDNav navigation.

open_workbook function

stingray.workbook.open_workbook(source: Path) Workbook[NDInstance] | Workbook[WBInstance] | Workbook[DInstance]

CSV Implementation

name_cleaner function

stingray.workbook.name_cleaner(name: str) str

JSON Schema names for anchors and properties are validated against patterns defined in places like core/$defs/anchorString.

Anchor names must match the following pattern.

r"^[A-Za-z_][-A-Za-z0-9._]*$"

This function replaces illegal characters with “_”

>>> name_cleaner("TYPICAL-COBOL")
'TYPICAL-COBOL'
>>> name_cleaner("common_python")
'common_python'
>>> name_cleaner("Not a 'good' name")
'Not_a_good_name'
>>> name_cleaner("!@#$%^")
'_'
>>> name_cleaner('')
''

Core Workbook, Sheet, and Row Model

class stingray.workbook.Workbook(name: Path | str, **kwargs: Any)

The Facade over all workbooks and COBOL Files.

This provides uniform access to a variety of physical formats.

Each subclass binds in an Unpacker and the unique features of that Unpacker.

Generally, a Workbook can be opened one of three ways

  • Specifically. wb = CSV_Workbook(path). This expects an explicit wb.close() to release resources.

  • As a context manager with CSV_Workbook(path) as wb:. The context manager release resources.

  • Via the open_workbook() function.

close() None

Closes the workbook instance.

sheet(name: str) Sheet[Instance]

Builds a Sheet instance for the given sheet name.

Parameters:

name – Name of the sheet

Returns:

Sheet instance

sheet_iter() Iterator[Sheet[Instance]]

Yields all Sheet instances in this Workbook.

Yields:

Sheet instance

class stingray.workbook.Sheet(workbook: Workbook[Instance], sheet_name: str)

A Sheet of a Workbook. This can also be a table on a page of a Numbers workbook.

It needs a schema binding to describe the content of each row.

Either a Schema is loaded from an external source. OR an internal Schema loader is used to extract a schema from the first few rows of the sheet.

These can be built manually, but the expectation is that they’re created by the containing Workbook object.

row_iter() Iterator[Row[Instance]]

Extracts rows from the unpacker’s instance_iter() method. Applies the schema to create Row instances.

This leverages the two-phase processing of HeadingRowSchemaLoader class. The initial rows pass through the loader’s header() method. After the header has built a schema, the rows are consumed via the header() of the loader.

Yields:

Row instances.

rows() Iterator[Row[Instance]]

Extracts rows from the unpacker’s instance_iter() method. Applies the schema to create Row instances.

This leverages the two-phase processing of HeadingRowSchemaLoader class. The initial rows pass through the loader’s header() method. After the header has built a schema, the rows are consumed via the header() of the loader.

Yields:

Row instances.

set_schema(schema: Schema) Sheet[Instance]

Provides an externally-supplied schema.

This also sets a “do nothing” schema loader to assure that all rows are processed.

Parameters:

schema – The Schema to use.

set_schema_loader(loader: SchemaLoader[Instance]) Sheet[Instance]

Provides a SchemaLoader that builds a schema by extracting some of the Instances.

Parameters:

loader – A subclass of HeadingRowSchemaLoader that will load schema and filter rows.

class stingray.workbook.Row(sheet: Sheet[Instance], instance: Instance)

Wrapper around a Nav object bound to an Instance.

Note that both Instance and Sheet have the schema. While can be excessive, it’s also possible the sheet suffers from many schema and the assignment is on an instance-by-instance basis.

dump() None

Print a formatted dump of this row.

get(name: str) Nav

Creates a helper Nav and navigates to the given name.

Parameters:

name – An $anchor name.

Returns:

A Nav subclass appropriate to the Workbook’s Unpacker.

name(name: str) Nav

Creates a helper Nav and navigates to the given name.

Parameters:

name – An $anchor name.

Returns:

A Nav subclass appropriate to the Workbook’s Unpacker.

property schema: Schema

Extracts the schema from the associated sheet.

Returns:

Schema instance.

values() list[Any]

Creates a helper Nav and returns the instances values for each name in the schema.

Returns:

A list of cell values.

Schema Loaders

class stingray.workbook.SchemaLoader

Loads a schema.

There are two general cases:

  • Load an external schema from a Workbook.

  • Extract the schema from within a sheet of a workbook. Often this is a heading row, but it can be more complex.

The first case is simpler. The ExternalSchemaLoader subclass implements a with a ExternalSchemaLoader.load() method. This builds a JSONSchema from the rows it finds. All other external loaders should subclass this.

The HeadingRowSchemaLoader subclass extracts the top row of a sheet and creates a JSONSchema. The rows are presented as an iterable that’s comsumed in two phases:

  1. The header() method consumes rows from an iterator until it has the required header.

  2. The body() method can consume rows from an iterator to filter empty rows, footers, or other irrelevancies from a workbook sheet. An alternative is to simply return the iterator if no filtering will be done.

This superclass specifically does nothing. This is used when an external schema has been bound to a Sheet and no futher processing is required. For example, when processing COBOL files where the schema is separate.

A schema is returned as a JSONSchema document. This is almost always converted to a Schema instance. If, however, changes need to be made, or the schema is used outside this module, it’s already in a standard format.

body(source: Iterator[Instance]) Iterator[Instance]

Consume the non-header rows, applying an additional filter criteria.

Parameters:

source – The iterator over instances. This is the same iterator provided to header(); it will be in the state left at the end of header() processing.

Returns:

Instance instances for each row after the header.

header(source: Iterator[Instance]) None | bool | int | float | str | list[Any] | dict[str, Any]

Consume the first row or rows to build a schema.

Parameters:

source – An iterator over instances.

Returns:

A Schema or None if no schema can be parsed.

class stingray.workbook.COBOLSchemaLoader(source: Path)

The most common case is a single COBOL Schema. For other, more complex situations, the single schema assumption may not be appropriate.

load() None | bool | int | float | str | list[Any] | dict[str, Any]

Loads the given COBOL schema from a copybook file.

If more than one are present, they’re saved as the schemas attribute.

The first 01 level record is returned as a JSONSchema.

Returns:

JSONSchema from a COBOL DDE.

class stingray.workbook.ExternalSchemaLoader(sheet: Sheet[Instance])

Read an external source of data to prepare a schema.

A default schema for extrnal metadata is ExternalSchemaLoader.META_SCHEMA. The odds of this being correct are low. The metaschema has three attributes:

  • name. This becomes a property name.

  • description. This becomes a property description.

  • dataType. This becomes a property type.

A subclass can do more sophisticated parsing of complex metadata. For example, COBOL DDE parsing, is an example of complex schema loading.

A subclass could also read JSONSchema directly, if that’s available.

META_SCHEMA = {'properties': {'dataType': {'description': 'field data type', 'position': 2, 'type': 'string'}, 'description': {'description': 'field description', 'position': 1, 'type': 'string'}, 'name': {'description': 'field name', 'position': 0, 'type': 'string'}}, 'title': 'generic meta schema for external schema documents', 'type': 'object'}
load() None | bool | int | float | str | list[Any] | dict[str, Any]

Loads the sheet’s rows and builds a JSONSchema.

Returns:

JSONSchema instance built from the sbeet.

class stingray.workbook.HeadingRowSchemaLoader

Create a schema from the first row of a workbook sheet. The “Instance” is expected to be a WBInstance, which is a Sequence of column values, parsed by a WBUnpacker.

header(source: Iterator[Instance]) None | bool | int | float | str | list[Any] | dict[str, Any]

Creates a JSONSchema from the first row of a sheet.

Parameters:

source – An iterator over instances.

Returns:

A Schema or None if no schema can be parsed.

COBOL Files

class stingray.workbook.COBOL_EBCDIC_File(name: str | Path, file_object: IO | None = None, recfm_class: Type[RECFM_Reader] | None = None, lrecl: int | None = None, **kwargs: str)

EBCDIC-encoded files. No newline delimiters are used.

The NDInstance is AnyStr, which is a union of str | bytes. This is narrower than that, and could be bytes only, not NDInstance.

close() None

Closes the workbook

sheet(name: str) Sheet[NDInstance]

Builds a Sheet instance for the given sheet name.

Parameters:

name – Name of the sheet (Not used)

Returns:

COBOL_EBCDIC_Sheet instance

sheet_iter() Iterator[Sheet[NDInstance]]

Yields all Sheet instances in this Workbook.

Yields:

COBOL_EBCDIC_Sheet instance

class stingray.workbook.COBOL_EBCDIC_Sheet(workbook: Workbook[Instance], sheet_name: str)

The single “Sheet” in a file. This is a container for the rows. It handles the RECFM and LRECL complexities of variable-length non-delimited records.

row_iter() Iterator[Row[NDInstance]]

Extracts rows from the unpacker. Applies the schema to create a Row instances.

Yields:

Row instances.

rows() Iterator[Row[NDInstance]]

Extracts rows from the unpacker. Applies the schema to create a Row instances.

Yields:

Row instances.

set_schema(schema: Schema) Sheet[NDInstance]

Sets the schema for this COBOL file.

If the workbook has an LRECL alread set, this is used to process records.

If the workbook did not have an LRECL set, the Schema is examined to see if an LRECL can be computed.

This extracts the LRECL eagerly to make the it visible as soon as possible.

Parameters:

schemaSchema instance built from the COBOL copybook.

class stingray.workbook.COBOL_Text_File(name: str | Path, file_object: IO | None = None, **kwargs: str)

Facade for COBOL data in native text with in a fixed format, newline delimited.

This uses the TextUnpacker unpacker.

close() None

Closes the Workbook.

CSV Workbooks

class stingray.workbook.CSVUnpacker

Upacker that wraps the csv module.

close() None

Close the CSV file.

instance_iter(sheet: str, **kwargs: Any) Iterator[WBInstance]

Yields rows from the CSV File.

Parameters:
  • sheet – name of the sheet (not used)

  • kwargs – Additional arguments provided to csv.reader().

Yields:

rows from the CSV file.

open(name: Path, file_object: IO | None = None) None

Opens a CSV file for unpacking.

Parameters:
  • name – Path to the CSV file.

  • file_object – Optional Open IO object. If omitted the name is opened.

sheet_iter() Iterator[str]

Yields all sheet names. For CSV files there’s only one sheet, and the name is “”.

Yields:

Literal[“”]

class stingray.workbook.CSV_Workbook(name: str | Path, file_object: IO | None = None, **kwargs: Any)

Facade for CSV and TAB Workbooks.

close() None

Closes the workbook.

JSON Workbooks

class stingray.workbook.JSONUnpacker

Unpacker that wraps the json module..

close() None

Closes the NDJSON file.

instance_iter(name: str, **kwargs: Any) Iterator[DInstance]

Yields items from the NDJSON File. Documents are delimited with newlines.

Parameters:
  • sheet – name of the sheet (not used)

  • kwargs – Additional arguments provided to json.loads().

Yields:

rows from the NDJSON file.

open(name: Path, file_object: IO | None = None, **kwargs: Any) None

Opens the NDJSON file.

Parameters:
  • name – Path to the file.

  • file_object – Optional Open IO object. If omitted the name is opened.

  • kwargs – KW arguments (not used)

sheet_iter() Iterator[str]

Yields all sheet names. For NDJSON files there’s only one sheet, and the name is “”.

Yields:

Literal[“”]

class stingray.workbook.JSON_Workbook(name: str | Path, file_object: IO | None = None, **kwargs: Any)

Facade for NDJSON Workbooks. Also called JSON Newline.

close() None

Closees the Workbook file.

Workbook File Registry

class stingray.workbook.WBFileRegistry

A global registry for Workbook classes and their associated file suffixes. It makes an open_workbook() function that can open a wide variety of file types.

>>> file_registry = WBFileRegistry()
>>> @file_registry.file_suffix(".xyz")
... class XYZ_Workbook(Workbook):
...     pass
>>> file_registry.suffix_map[".xyz"] == XYZ_Workbook
True
file_suffix(*name_list: str) Callable[[Type[Workbook[NDInstance] | Workbook[WBInstance] | Workbook[DInstance]]], Type[Workbook[NDInstance] | Workbook[WBInstance] | Workbook[DInstance]]]

Decorator to adds a file suffix and associated class to the registry.

Parameters:

name_list – String suffix or sequence of suffices.

Returns:

Decorator that’s applied to a class definition.

open_workbook(source: Path) Workbook[NDInstance] | Workbook[WBInstance] | Workbook[DInstance]

Opens an appropriate subclass of Workbook for the given file.

Parameters:

source – Path to the Workbook

Returns:

A Workbook subclass.

file_registry.file_suffix()

A decorator used to mark a class with the file extensions it handles.

Legacy API Concept

The version 4.0 concept for the API looked like this:

Internal, Embedded Schema:

from stingray.workbook import open_workbook, EmbeddedSchemaSheet, HeadingRowSchemaLoader

with open_workbook(path) as workbook:
    sheet = EmbeddedSchemaSheet(workbook, 'Sheet1', HeadingRowSchemaLoader)
    process_sheet(sheet)

External Schema:

from stingray.workbook import open_workbook, ExternalSchemaSheet, ExternalSchemaLoader

with open_workbook(path) as schema_wb:
    esl = ExternalSchemaLoader(schema_wb, sheet_name='Schema')
    schema = esl.load()
with open_workbook(path) as workbook:
    sheet = ExternalSchemaSheet(workbook, 'Sheet1', schema)
    process_sheet(sheet)

If necessary, the external schema can have a meta-schema. It may be necessary to define a conversion function to create a useful JSON Schema from a schema workbook.