stingray.workbook ================= .. automodule:: stingray.workbook The definitions are mostly protocols to handle non-delimited (i.e., COBOL files), delimited files, and workbooks. This depends on the lower-level :py:class:`Schema`, :py:class:`Instance`, :py:class:`Nav`, and :py:class:`Location` constructs. .. uml:: @startuml abstract class Schema abstract class Unpacker class Location Location --> Schema 'abstract class NDInstance 'Location -> NDInstance class LocationMaker { from_instance(instance): Location } Unpacker --> LocationMaker LocationMaker --> "n" Location : creates /'Details... abstract class Nav class NDNav class DNav class WBNav Nav <|-- NDNav Nav <|-- DNav Nav <|-- WBNav Unpacker --> NDNav : creates Unpacker --> DNav : creates NDNav --> Location class JSON DNav --> JSON '/ class File abstract class Workbook ##[bold]blue Workbook --> Unpacker Workbook --> File : opens Unpacker --> File : reads class Sheet ##[bold]blue { row_iter(): Row } Workbook *-- "1:n" Sheet Sheet --> Schema class Row ##[bold]blue { name(): Any } Sheet *-- "n" Row Row --> NDNav : "[non-delimited]" Row --> DNav : "[delimited]" Row --> WBNav : "[workbook]" class EmbeddedSchemaSheet class ExternalSchemaSheet Sheet <|-- EmbeddedSchemaSheet Sheet <|-- ExternalSchemaSheet class SchemaLoader EmbeddedSchemaSheet --> SchemaLoader ExternalSchemaSheet --> SchemaLoader SchemaLoader --> Schema : creates @enduml Here are some additional side-bar considerations for other formats that depend on external modules. - CSV is built-in. We permit kwargs to provide additional dialect details. - JSON is built-in. We'll treate newline delimited JSON like CSV. - XLS is a weird proprietary thing. The ``xlrd`` project (https://pypi.org/project/xlrd/) supports it. - ODS and XLSX are XML files. Incremental parsing is helpful here because they can be large. See https://openpyxl.readthedocs.io/en/stable/ and http://docs.pyexcel.org/en/v0.0.6-rc2/. - Numbers uses protobuf. The legacy version of Stingray Reader had protobuf definitions which appear to work and a snappy decoder. See https://pypi.org/project/numbers-parser/ for the better solution currently in use. - TOML requires an external library. An :py:class:`Unpacker` subclass can decompose a "one big list" TOML document into individual rows. - YAML requires an external library. We'll use the iterative parser as a default. An :py:class:`Unpacker` subclass can decompose a "one big list" YAML document into individual rows. - XML is built-in. A schema can drive navgiation through the XML document, name the various tags of interest. Other tags which may be present would be ignored. A given workbook has two possible sources for a schema: internal and external. An internal schema might be the first row or it might require more sophisticated parsing. An external schema might be hard-coded in the application, or might be a separate document with its own meta-schema. Generally, the schema applies to a sheet (or a Table in a Workspace for Numbers.) API Concept ------------------ A "fluent" interface is used to open a sheet, extract a header, and process rows. The alternative is to open a sheet, apply the externally loaded schema, and use this to process the sheet's rows. Once a sheet has been bound to a schema, the rows can be processed. **Internal, Embedded Schema**: :: >>> from stingray import open_workbook, HeadingRowSchemaLoader, Row >>> from pathlib import Path >>> import os >>> from typing import Iterable >>> source_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_quartet_data.csv" >>> def process_sheet(rows: Iterable[Row]) -> None: ... for row in rows: ... row.dump() ... break # Stop after 1 row. >>> with open_workbook(source_path) as workbook: ... sheet = workbook.sheet('Sheet1') ... _ = sheet.set_schema_loader(HeadingRowSchemaLoader()) ... process_sheet(sheet.rows()) Field Value object x123 '10.0' y1 '8.04' y2 '9.14' y3 '7.46' x4 '8.0' y4 '6.58' In this case, the :py >>> from pathlib import Path >>> import os >>> from typing import Iterable >>> source_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_quartet_data.csv" >>> def process_sheet(rows: Iterable[Row]) -> None: ... for row in rows: ... row.dump() ... break # Stop after 1 row. >>> with open_workbook(source_path) as workbook: ... sheet = workbook.sheet('Sheet1') ... _ = sheet.set_schema_loader(HeadingRowSchemaLoader()) ... process_sheet(sheet.rows()) Field Value object x123 '10.0' y1 '8.04' y2 '9.14' y3 '7.46' x4 '8.0' y4 '6.58' In this case, the :py:meth:`rows` method of the :py:class:`Sheet` instance will exclude the header rows consumed by the :py:class:`HeadingRowSchemaLoader`. **External Schema**: :: >>> from stingray import open_workbook, ExternalSchemaLoader, Row, SchemaMaker >>> from pathlib import Path >>> import os >>> from typing import Iterable >>> source_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_quartet_data.csv" >>> schema_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_schema.csv" >>> with open_workbook(schema_path) as metaschema_workbook: ... schema_sheet = metaschema_workbook.sheet('Sheet1') ... _ = schema_sheet.set_schema(SchemaMaker().from_json(ExternalSchemaLoader.META_SCHEMA)) ... json_schema = ExternalSchemaLoader(schema_sheet).load() ... schema = SchemaMaker().from_json(json_schema) >>> with open_workbook(source_path) as workbook: ... sheet = workbook.sheet('Sheet1').set_schema(schema) ... process_sheet(sheet.rows()) Field Value object x123 'x123' y1 'y1' y2 'y2' y3 'y3' x4 'x4' y4 'y4' In this case, the :py >>> from pathlib import Path >>> import os >>> from typing import Iterable >>> source_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_quartet_data.csv" >>> schema_path = Path(os.environ.get("SAMPLES", "sample")) / "Anscombe_schema.csv" >>> with open_workbook(schema_path) as metaschema_workbook: ... schema_sheet = metaschema_workbook.sheet('Sheet1') ... _ = schema_sheet.set_schema(SchemaMaker().from_json(ExternalSchemaLoader.META_SCHEMA)) ... json_schema = ExternalSchemaLoader(schema_sheet).load() ... schema = SchemaMaker().from_json(json_schema) >>> with open_workbook(source_path) as workbook: ... sheet = workbook.sheet('Sheet1').set_schema(schema) ... process_sheet(sheet.rows()) Field Value object x123 'x123' y1 'y1' y2 'y2' y3 'y3' x4 'x4' y4 'y4' In this case, the :py:meth:`rows` method of the :py:class:`Sheet` instance will include all rows. This means the header row is treated like data when an external schema is applied. The **process_sheet()** method: :: def process_sheet(rows: Iterator[Row]) -> None: for row in rows: print(f'{row.name("field").value()=}') The :py:meth:`name` method returns a :py:class:`Nav` object. This has a :py:meth:`value` method that will extract the value for a given attribute. Sheet and Row ------------- The :py:class:`Sheet` class contained metadata about the sheet, and a row iterator. It's generally used like this: :: def process_sheet(sheet: Sheet) -> None: for row in sheet.rows(): process_row(row) A :py:class:`Row` contains an instance that's bound to the :py:class::py:class:`Unpacker` and the :py:class:`Schema`. This will build :py:class:`Nav` objects for navigation. Where necessary, this may involve creating :py:class:`Location` objects as part of :py:class:`NDNav` navigation. open_workbook function ---------------------- .. autofunction:: open_workbook name_cleaner function --------------------- .. autofunction:: name_cleaner Core Workbook, Sheet, and Row Model ------------------------------------ .. autoclass:: Workbook :members: :undoc-members: .. autoclass:: Sheet :members: :undoc-members: .. autoclass:: Row :members: :undoc-members: Schema Loaders -------------- .. autoclass:: SchemaLoader :members: :undoc-members: .. autoclass:: COBOLSchemaLoader :members: :undoc-members: .. autoclass:: ExternalSchemaLoader :members: :undoc-members: .. autoclass:: HeadingRowSchemaLoader :members: :undoc-members: COBOL Files ----------- .. autoclass:: COBOL_EBCDIC_File :members: :undoc-members: .. autoclass:: COBOL_EBCDIC_Sheet :members: :undoc-members: .. autoclass:: COBOL_Text_File :members: :undoc-members: CSV Workbooks ------------- .. autoclass:: CSVUnpacker :members: :undoc-members: .. autoclass:: CSV_Workbook :members: :undoc-members: JSON Workbooks -------------- .. autoclass:: JSONUnpacker :members: :undoc-members: .. autoclass:: JSON_Workbook :members: :undoc-members: Workbook File Registry ---------------------- .. autoclass:: WBFileRegistry :members: :undoc-members: .. function:: file_registry.file_suffix A decorator used to mark a class with the file extensions it handles. Legacy API Concept ------------------ The version 4.0 concept for the API looked like this: **Internal, Embedded Schema**: :: from stingray.workbook import open_workbook, EmbeddedSchemaSheet, HeadingRowSchemaLoader with open_workbook(path) as workbook: sheet = EmbeddedSchemaSheet(workbook, 'Sheet1', HeadingRowSchemaLoader) process_sheet(sheet) **External Schema**: :: from stingray.workbook import open_workbook, ExternalSchemaSheet, ExternalSchemaLoader with open_workbook(path) as schema_wb: esl = ExternalSchemaLoader(schema_wb, sheet_name='Schema') schema = esl.load() with open_workbook(path) as workbook: sheet = ExternalSchemaSheet(workbook, 'Sheet1', schema) process_sheet(sheet) If necessary, the external schema can have a meta-schema. It may be necessary to define a conversion function to create a useful JSON Schema from a schema workbook.