stingray.workbook¶
workbook – Row, Sheet, and Workbook Definitions.
The Stingray Reader treats all workbooks alike.
This means wrapping CSV files, Numbers_13, ODS, and XLSX files
with a Facade of Workbook
, Sheet
, and Row
These classes describe the relevant features of these file formats.
This includes the following built-in workbook formats:
CSV via built-in
csv
module.NDJSON a/k/a JSON Newline via the built-in
json
module.COBOL files in EBCDIC via the
stingray.estruct
module.COBOL files in native Text
TAB via CSV reader with dialetc options
Pure Text via External Schema. See
workbook.simple
and the metadata insimple.csv
.
Other formats are in the implementations
module.
COBOL File Implementation – EBCDIC and Text¶
The EBCDIC files require specific physical “Record Format” (RECFM) assistance.
These classes define a number of Z/OS RECFM conversion.
The estruct
module has classes for four actual RECFM’s plus an additional special case.
estruct.RECFM_F
- Fixed.estruct.RECFM_FB
- Fixed Blocked.estruct.RECFM_V
- Variable, data has RDW preceeding each record.estruct.RECFM_VB
- Variable Blocked, data must have BDW and RDW words.estruct.RECFM_N
- Variable, but no BDW or RDW words. This involves some buffer management magic to recover the records properly. This is required to handle Occurs Depending On cases where there’s no V or VB header. This requires the consumer of bytes to announce how many bytes where needed so the reader can advance an appropriate amount.
The definitions are mostly protocols to handle non-delimited (i.e., COBOL files), delimited files, and workbooks.
This depends on the lower-level Schema
, Instance
, Nav
, and Location
constructs.
Here are some additional side-bar considerations for other formats that depend on external modules.
CSV is built-in. We permit kwargs to provide additional dialect details.
JSON is built-in. We’ll treate newline delimited JSON like CSV.
XLS is a weird proprietary thing. The
xlrd
project (https://pypi.org/project/xlrd/) supports it.ODS and XLSX are XML files. Incremental parsing is helpful here because they can be large. See https://openpyxl.readthedocs.io/en/stable/ and http://docs.pyexcel.org/en/v0.0.6-rc2/.
Numbers uses protobuf. The legacy version of Stingray Reader had protobuf definitions which appear to work and a snappy decoder. See https://pypi.org/project/numbers-parser/ for the better solution currently in use.
TOML requires an external library. An
Unpacker
subclass can decompose a “one big list” TOML document into individual rows.YAML requires an external library. We’ll use the iterative parser as a default. An
Unpacker
subclass can decompose a “one big list” YAML document into individual rows.XML is built-in. A schema can drive navgiation through the XML document, name the various tags of interest. Other tags which may be present would be ignored.
A given workbook has two possible sources for a schema: internal and external. An internal schema might be the first row or it might require more sophisticated parsing. An external schema might be hard-coded in the application, or might be a separate document with its own meta-schema.
Generally, the schema applies to a sheet (or a Table in a Workspace for Numbers.)
API Concept¶
A “fluent” interface is used to open a sheet, extract a header, and process rows. The alternative is to open a sheet, apply the externally loaded schema, and use this to process the sheet’s rows. Once a sheet has been bound to a schema, the rows can be processed.
Internal, Embedded Schema:
>>> from stingray import open_workbook, HeadingRowSchemaLoader, Row
>>> from pathlib import Path
>>> import os
>>> from typing import Iterable
>>> source_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_quartet_data.csv"
>>> def process_sheet(rows: Iterable[Row]) -> None:
... for row in rows:
... row.dump()
... break # Stop after 1 row.
>>> with open_workbook(source_path) as workbook:
... sheet = workbook.sheet('Sheet1')
... _ = sheet.set_schema_loader(HeadingRowSchemaLoader())
... process_sheet(sheet.rows())
Field Value
object
x123 '10.0'
y1 '8.04'
y2 '9.14'
y3 '7.46'
x4 '8.0'
y4 '6.58'
In this case, the :py
>>> from pathlib import Path
>>> import os
>>> from typing import Iterable
>>> source_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_quartet_data.csv"
>>> def process_sheet(rows: Iterable[Row]) -> None:
... for row in rows:
... row.dump()
... break # Stop after 1 row.
>>> with open_workbook(source_path) as workbook:
... sheet = workbook.sheet('Sheet1')
... _ = sheet.set_schema_loader(HeadingRowSchemaLoader())
... process_sheet(sheet.rows())
Field Value
object
x123 '10.0'
y1 '8.04'
y2 '9.14'
y3 '7.46'
x4 '8.0'
y4 '6.58'
In this case, the rows()
method of the Sheet
instance will exclude the header rows consumed
by the HeadingRowSchemaLoader
.
External Schema:
>>> from stingray import open_workbook, ExternalSchemaLoader, Row, SchemaMaker
>>> from pathlib import Path
>>> import os
>>> from typing import Iterable
>>> source_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_quartet_data.csv"
>>> schema_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_schema.csv"
>>> with open_workbook(schema_path) as metaschema_workbook:
... schema_sheet = metaschema_workbook.sheet('Sheet1')
... _ = schema_sheet.set_schema(SchemaMaker().from_json(ExternalSchemaLoader.META_SCHEMA))
... json_schema = ExternalSchemaLoader(schema_sheet).load()
... schema = SchemaMaker().from_json(json_schema)
>>> with open_workbook(source_path) as workbook:
... sheet = workbook.sheet('Sheet1').set_schema(schema)
... process_sheet(sheet.rows())
Field Value
object
x123 'x123'
y1 'y1'
y2 'y2'
y3 'y3'
x4 'x4'
y4 'y4'
In this case, the :py
>>> from pathlib import Path
>>> import os
>>> from typing import Iterable
>>> source_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_quartet_data.csv"
>>> schema_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_schema.csv"
>>> with open_workbook(schema_path) as metaschema_workbook:
... schema_sheet = metaschema_workbook.sheet('Sheet1')
... _ = schema_sheet.set_schema(SchemaMaker().from_json(ExternalSchemaLoader.META_SCHEMA))
... json_schema = ExternalSchemaLoader(schema_sheet).load()
... schema = SchemaMaker().from_json(json_schema)
>>> with open_workbook(source_path) as workbook:
... sheet = workbook.sheet('Sheet1').set_schema(schema)
... process_sheet(sheet.rows())
Field Value
object
x123 'x123'
y1 'y1'
y2 'y2'
y3 'y3'
x4 'x4'
y4 'y4'
In this case, the rows()
method of the Sheet
instance will include all rows.
This means the header row is treated like data when an external schema is applied.
The process_sheet() method:
def process_sheet(rows: Iterator[Row]) -> None:
for row in rows:
print(f'{row.name("field").value()=}')
The name()
method returns a Nav
object. This has a value()
method that will extract the value
for a given attribute.
Sheet and Row¶
The Sheet
class contained metadata about the sheet, and a row iterator.
It’s generally used like this:
def process_sheet(sheet: Sheet) -> None:
for row in sheet.rows():
process_row(row)
A Row
contains an instance that’s bound to the :py:class:Unpacker
and the Schema
.
This will build Nav
objects for navigation. Where necessary, this may involve creating
Location
objects as part of NDNav
navigation.
open_workbook function¶
- stingray.workbook.open_workbook(source: Path) Workbook[NDInstance] | Workbook[WBInstance] | Workbook[DInstance] ¶
CSV Implementation
name_cleaner function¶
- stingray.workbook.name_cleaner(name: str) str ¶
JSON Schema names for anchors and properties are validated against patterns defined in places like core/$defs/anchorString.
Anchor names must match the following pattern.
r"^[A-Za-z_][-A-Za-z0-9._]*$"
This function replaces illegal characters with “_”
>>> name_cleaner("TYPICAL-COBOL") 'TYPICAL-COBOL' >>> name_cleaner("common_python") 'common_python' >>> name_cleaner("Not a 'good' name") 'Not_a_good_name' >>> name_cleaner("!@#$%^") '_' >>> name_cleaner('') ''
Core Workbook, Sheet, and Row Model¶
- class stingray.workbook.Workbook(name: Path | str, **kwargs: Any)¶
The Facade over all workbooks and COBOL Files.
This provides uniform access to a variety of physical formats.
Each subclass binds in an Unpacker and the unique features of that Unpacker.
Generally, a Workbook can be opened one of three ways
Specifically.
wb = CSV_Workbook(path)
. This expects an explicitwb.close()
to release resources.As a context manager
with CSV_Workbook(path) as wb:
. The context manager release resources.Via the
open_workbook()
function.
- close() None ¶
Closes the workbook instance.
- class stingray.workbook.Sheet(workbook: Workbook[Instance], sheet_name: str)¶
A Sheet of a Workbook. This can also be a table on a page of a Numbers workbook.
It needs a schema binding to describe the content of each row.
Either a Schema is loaded from an external source. OR an internal Schema loader is used to extract a schema from the first few rows of the sheet.
These can be built manually, but the expectation is that they’re created by the containing
Workbook
object.- row_iter() Iterator[Row[Instance]] ¶
Extracts rows from the unpacker’s
instance_iter()
method. Applies the schema to createRow
instances.This leverages the two-phase processing of
HeadingRowSchemaLoader
class. The initial rows pass through the loader’sheader()
method. After the header has built a schema, the rows are consumed via theheader()
of the loader.- Yields:
Row
instances.
- rows() Iterator[Row[Instance]] ¶
Extracts rows from the unpacker’s
instance_iter()
method. Applies the schema to createRow
instances.This leverages the two-phase processing of
HeadingRowSchemaLoader
class. The initial rows pass through the loader’sheader()
method. After the header has built a schema, the rows are consumed via theheader()
of the loader.- Yields:
Row
instances.
- set_schema(schema: Schema) Sheet[Instance] ¶
Provides an externally-supplied schema.
This also sets a “do nothing” schema loader to assure that all rows are processed.
- Parameters:
schema – The
Schema
to use.
- set_schema_loader(loader: SchemaLoader[Instance]) Sheet[Instance] ¶
Provides a
SchemaLoader
that builds a schema by extracting some of the Instances.- Parameters:
loader – A subclass of
HeadingRowSchemaLoader
that will load schema and filter rows.
- class stingray.workbook.Row(sheet: Sheet[Instance], instance: Instance)¶
Wrapper around a
Nav
object bound to anInstance
.Note that both Instance and Sheet have the schema. While can be excessive, it’s also possible the sheet suffers from many schema and the assignment is on an instance-by-instance basis.
- dump() None ¶
Print a formatted dump of this row.
- get(name: str) Nav ¶
Creates a helper
Nav
and navigates to the given name.- Parameters:
name – An
$anchor
name.- Returns:
A
Nav
subclass appropriate to the Workbook’s Unpacker.
- name(name: str) Nav ¶
Creates a helper
Nav
and navigates to the given name.- Parameters:
name – An
$anchor
name.- Returns:
A
Nav
subclass appropriate to the Workbook’s Unpacker.
- values() list[Any] ¶
Creates a helper
Nav
and returns the instances values for each name in the schema.- Returns:
A list of cell values.
Schema Loaders¶
- class stingray.workbook.SchemaLoader¶
Loads a schema.
There are two general cases:
Load an external schema from a Workbook.
Extract the schema from within a sheet of a workbook. Often this is a heading row, but it can be more complex.
The first case is simpler. The
ExternalSchemaLoader
subclass implements a with aExternalSchemaLoader.load()
method. This builds a JSONSchema from the rows it finds. All other external loaders should subclass this.The
HeadingRowSchemaLoader
subclass extracts the top row of a sheet and creates a JSONSchema. The rows are presented as an iterable that’s comsumed in two phases:The
header()
method consumes rows from an iterator until it has the required header.The
body()
method can consume rows from an iterator to filter empty rows, footers, or other irrelevancies from a workbook sheet. An alternative is to simply return the iterator if no filtering will be done.
This superclass specifically does nothing. This is used when an external schema has been bound to a
Sheet
and no futher processing is required. For example, when processing COBOL files where the schema is separate.A schema is returned as a JSONSchema document. This is almost always converted to a
Schema
instance. If, however, changes need to be made, or the schema is used outside this module, it’s already in a standard format.- body(source: Iterator[Instance]) Iterator[Instance] ¶
Consume the non-header rows, applying an additional filter criteria.
- header(source: Iterator[Instance]) None | bool | int | float | str | list[Any] | dict[str, Any] ¶
Consume the first row or rows to build a schema.
- Parameters:
source – An iterator over instances.
- Returns:
A Schema or None if no schema can be parsed.
- class stingray.workbook.COBOLSchemaLoader(source: Path)¶
The most common case is a single COBOL Schema. For other, more complex situations, the single schema assumption may not be appropriate.
- load() None | bool | int | float | str | list[Any] | dict[str, Any] ¶
Loads the given COBOL schema from a copybook file.
If more than one are present, they’re saved as the
schemas
attribute.The first
01
level record is returned as a JSONSchema.- Returns:
JSONSchema from a COBOL DDE.
- class stingray.workbook.ExternalSchemaLoader(sheet: Sheet[Instance])¶
Read an external source of data to prepare a schema.
A default schema for extrnal metadata is
ExternalSchemaLoader.META_SCHEMA
. The odds of this being correct are low. The metaschema has three attributes:name. This becomes a property name.
description. This becomes a property description.
dataType. This becomes a property type.
A subclass can do more sophisticated parsing of complex metadata. For example, COBOL DDE parsing, is an example of complex schema loading.
A subclass could also read JSONSchema directly, if that’s available.
- META_SCHEMA = {'properties': {'dataType': {'description': 'field data type', 'position': 2, 'type': 'string'}, 'description': {'description': 'field description', 'position': 1, 'type': 'string'}, 'name': {'description': 'field name', 'position': 0, 'type': 'string'}}, 'title': 'generic meta schema for external schema documents', 'type': 'object'}¶
- load() None | bool | int | float | str | list[Any] | dict[str, Any] ¶
Loads the sheet’s rows and builds a JSONSchema.
- Returns:
JSONSchema instance built from the sbeet.
- class stingray.workbook.HeadingRowSchemaLoader¶
Create a schema from the first row of a workbook sheet. The “Instance” is expected to be a WBInstance, which is a Sequence of column values, parsed by a WBUnpacker.
- header(source: Iterator[Instance]) None | bool | int | float | str | list[Any] | dict[str, Any] ¶
Creates a JSONSchema from the first row of a sheet.
- Parameters:
source – An iterator over instances.
- Returns:
A Schema or None if no schema can be parsed.
COBOL Files¶
- class stingray.workbook.COBOL_EBCDIC_File(name: str | Path, file_object: IO | None = None, recfm_class: Type[RECFM_Reader] | None = None, lrecl: int | None = None, **kwargs: str)¶
EBCDIC-encoded files. No newline delimiters are used.
The NDInstance is AnyStr, which is a union of str | bytes. This is narrower than that, and could be bytes only, not NDInstance.
- close() None ¶
Closes the workbook
- sheet(name: str) Sheet[NDInstance] ¶
Builds a
Sheet
instance for the given sheet name.- Parameters:
name – Name of the sheet (Not used)
- Returns:
COBOL_EBCDIC_Sheet
instance
- sheet_iter() Iterator[Sheet[NDInstance]] ¶
Yields all
Sheet
instances in this Workbook.- Yields:
COBOL_EBCDIC_Sheet
instance
- class stingray.workbook.COBOL_EBCDIC_Sheet(workbook: Workbook[Instance], sheet_name: str)¶
The single “Sheet” in a file. This is a container for the rows. It handles the RECFM and LRECL complexities of variable-length non-delimited records.
- row_iter() Iterator[Row[NDInstance]] ¶
Extracts rows from the unpacker. Applies the schema to create a
Row
instances.- Yields:
Row
instances.
- rows() Iterator[Row[NDInstance]] ¶
Extracts rows from the unpacker. Applies the schema to create a
Row
instances.- Yields:
Row
instances.
- set_schema(schema: Schema) Sheet[NDInstance] ¶
Sets the schema for this COBOL file.
If the workbook has an LRECL alread set, this is used to process records.
If the workbook did not have an LRECL set, the Schema is examined to see if an LRECL can be computed.
This extracts the LRECL eagerly to make the it visible as soon as possible.
- Parameters:
schema –
Schema
instance built from the COBOL copybook.
CSV Workbooks¶
- class stingray.workbook.CSVUnpacker¶
Upacker that wraps the
csv
module.- close() None ¶
Close the CSV file.
- instance_iter(sheet: str, **kwargs: Any) Iterator[WBInstance] ¶
Yields rows from the CSV File.
- Parameters:
sheet – name of the sheet (not used)
kwargs – Additional arguments provided to
csv.reader()
.
- Yields:
rows from the CSV file.
- open(name: Path, file_object: IO | None = None) None ¶
Opens a CSV file for unpacking.
- Parameters:
name – Path to the CSV file.
file_object – Optional Open IO object. If omitted the name is opened.
- sheet_iter() Iterator[str] ¶
Yields all sheet names. For CSV files there’s only one sheet, and the name is “”.
- Yields:
Literal[“”]
JSON Workbooks¶
- class stingray.workbook.JSONUnpacker¶
Unpacker that wraps the
json
module..- close() None ¶
Closes the NDJSON file.
- instance_iter(name: str, **kwargs: Any) Iterator[DInstance] ¶
Yields items from the NDJSON File. Documents are delimited with newlines.
- Parameters:
sheet – name of the sheet (not used)
kwargs – Additional arguments provided to
json.loads()
.
- Yields:
rows from the NDJSON file.
- open(name: Path, file_object: IO | None = None, **kwargs: Any) None ¶
Opens the NDJSON file.
- Parameters:
name – Path to the file.
file_object – Optional Open IO object. If omitted the name is opened.
kwargs – KW arguments (not used)
- sheet_iter() Iterator[str] ¶
Yields all sheet names. For NDJSON files there’s only one sheet, and the name is “”.
- Yields:
Literal[“”]
Workbook File Registry¶
- class stingray.workbook.WBFileRegistry¶
A global registry for Workbook classes and their associated file suffixes. It makes an open_workbook() function that can open a wide variety of file types.
>>> file_registry = WBFileRegistry() >>> @file_registry.file_suffix(".xyz") ... class XYZ_Workbook(Workbook): ... pass >>> file_registry.suffix_map[".xyz"] == XYZ_Workbook True
- file_suffix(*name_list: str) Callable[[Type[Workbook[NDInstance] | Workbook[WBInstance] | Workbook[DInstance]]], Type[Workbook[NDInstance] | Workbook[WBInstance] | Workbook[DInstance]]] ¶
Decorator to adds a file suffix and associated class to the registry.
- Parameters:
name_list – String suffix or sequence of suffices.
- Returns:
Decorator that’s applied to a class definition.
- open_workbook(source: Path) Workbook[NDInstance] | Workbook[WBInstance] | Workbook[DInstance] ¶
Opens an appropriate subclass of
Workbook
for the given file.- Parameters:
source – Path to the Workbook
- Returns:
A
Workbook
subclass.
- file_registry.file_suffix()¶
A decorator used to mark a class with the file extensions it handles.
Legacy API Concept¶
The version 4.0 concept for the API looked like this:
Internal, Embedded Schema:
from stingray.workbook import open_workbook, EmbeddedSchemaSheet, HeadingRowSchemaLoader
with open_workbook(path) as workbook:
sheet = EmbeddedSchemaSheet(workbook, 'Sheet1', HeadingRowSchemaLoader)
process_sheet(sheet)
External Schema:
from stingray.workbook import open_workbook, ExternalSchemaSheet, ExternalSchemaLoader
with open_workbook(path) as schema_wb:
esl = ExternalSchemaLoader(schema_wb, sheet_name='Schema')
schema = esl.load()
with open_workbook(path) as workbook:
sheet = ExternalSchemaSheet(workbook, 'Sheet1', schema)
process_sheet(sheet)
If necessary, the external schema can have a meta-schema. It may be necessary to define a conversion function to create a useful JSON Schema from a schema workbook.