#####################
Software Architecture
#####################

In the :ref:`approach` section, we identified three steps.
When we have multiple processing steps, we need to define multiple functions to implement
those steps. The mapping isn't exact because we're refining abstractions into concrete
expressions.

1.  The core RSS reader function. Pragmatically, this tends to expand into more than one function
    so we can separate the core XML parsing from add-on parsing that's unique to the data we're working with.
    For example, decomposing the title isn't about RSS in general, it's about this specific problem domain.

2.  A saved state getter.
    There's not much to this, but it's important to recover the information
    from the chalky slate where each day's updates are recorded.

4.  A saved state writer. This can be pressed into service to write all of the files,
    since they're structurally identical.

Let's rough out the overall plan as conceptual pseudo-code shoing the original approach
steps and how they've become functions that map arguments to results.

::

    yesterdays_path, todays_path = path_maker("Some_RSS_URL")

    # 1. Get Daily RSS Feed
    todays_data = xml_reader("Some_RSS_URL")
    saved_data = csv_load(yesterday_path / "save.csv")

    # 2. Compare with saved history
    new_data = set(todays_data) - set(saved_data)

    # 3. Update saved history
    all_data = set(todays_data) | set(saved_data)

    csv_dump(new_data, today_path / "new.csv")
    csv_dump(todays_data, today_path / "daily.csv")
    csv_dump(all_data, today_path / "save.csv")

The above is purely conceptual. The idea is to outline a possible approach. As we delve into
details, we'll uncover places where this might be less than perfect. The final code will
differ based on the things we learn along the way.

This core processing sequence can be performed for any number of RSS channels. The idea is
to capture as many as necessary.

Some Design Patterns
--------------------

One overall idea that is helpful is the **Extract-Transform-Load** (ETL) pipeline.
This application involves ETL in miniature:

-   Extract raw data from it's XML representation to build Python objects.

-   Transform the raw data from one class of Python objects to another class of objects.

-   Load the transformed data into a "database". In this case, a bunch of directories
    and files. The conceptual "database load" is implemented as function to "dump" the data
    in CSV notation.

    The terminology change is awkward, but Python uses "dump" and "load"
    as the verbs of choice for dumping python objects into an external file and loading
    Python objects from an external file.

Additionally, we're working with several examples of **Serialization**. The concept
is to create a series of bytes that represent a Python object. The source data
was serialized in XML notation; we parse that to recover Python objects. The working
data is serialized in CSV notation; we load and dump those files.

.. py:module:: rss_status

Core Data Structures
--------------------

We have to address some technical nuance before going forward.

..  important::

    Python sets work with immutable objects.

    The ``csv`` module works with mutable List or Dict objects for each row.

This will be aggravating.

..  sidebar:: Mutability

    There's a firm and abiding distinction between mutable and immutable data
    structures in Python.

    -   Immutable: strings, numbers, tuples. These are objects with an unchanging
        internal state. We can't change the value of the integer 13.

    -   Mutable: Lists, Dicts, instances of classes we define. These are objects
        with an internal state we can adjust.

        ::

            >>> some_list = [1, 2, 3]
            >>> some_list.append(42)
            >>> some_list
            [1, 2, 3, 42]

        The list object, ``some_list``, was mutated by the :meth:`append` method.

    "But wait," you cry out. "What about this?"

    ::

        >>> some_number = 13
        >>> some_number = some_number + 1
        >>> some_number
        14

    "We mutated ``some_number``!"

    Well, actually...

    The object, ``13`` did not mutate. The expression ``some_number + 1`` is working with two
    immutable objects. A new immutable object, ``14``, was created by the expression.
    This new immutable object was then assigned the name ``some_number``. The old immutable ``13``
    object is no longer being used.

We have two choices for working with the mutable results of reading CSV files.

-   Implement our own versions of set subtraction and set union that work
    with mutable objects. This has the advantage of enabling the huge flexibility
    inherent in working with the :class:`csv.DictReader`: each row becomes
    a dictionary with an indefinition collection of keys.

-   Transmogrify the mutable objects into immutable data structures.
    (The technical term is "coerce", but I like transmogrification.)
    While this tends to limit our flexibility somewhat, it saves us from
    implementing set operations.

The transmogrification approach leads us to building a named tuple object
for each CSV row.

..  autoclass:: rss_status.SourceRSS
    :members:

..  autoclass:: rss_status.ExpandedRSS
    :members:


XML Parsing
-----------

XML parsing is handled gracefully by Python's :mod:`xml` package. There are several
parses, the :mod:`xml.etree` is particularly nice, and provides a wealth of features.

..  autofunction::   rss_status.xml_reader


Creating Working Paths
----------------------

We have two dimensions to the paths. The :func:`path_maker` function will
honor this by using a name from the URL and today's date.

..  autofunction:: rss_status.path_maker

What about yesterday's files?  We can't simply subtract one day from today's date.
For that to work, we'd have to religiously run this **every single day**. That's unreasonable.
What's easier is locating the most recent date which contains files for a given channel.

..  autofunction:: rss_status.find_yesterday

Saving and Recovering CSV State
-------------------------------

These two functions will save and restore collections of :class:`ExpandedRSS` objects.
We use the names "dump" and "load" to be consistent with other serialization packages.
We could use ``json.dump()`` or ``yaml.dump()`` instead of writing in CSV notation.

..  autofunction:: rss_status.csv_dump

..  autofunction:: rss_status.csv_load

Transformations
---------------

Currently, we only have one transformation, from :class:`SourceRSS` to
:class:`ExpandedRSS`.

..  autofunction:: rss_status.title_transform

In the long run, there will be additional transformations.

Adding transormations means this single function does too many things.
The current design has two elements combined together.

1.  Building a new list from individually transformed rows.

2.  Applying a single transformation process to create :class:`ExpandedRSS` objects.

In the longer run (see :ref:`expansions`) this needs to be refactored to allow
multiple transformation processes to be combined by an over-arching transformation pipeline.

Composite Processing for One Channel
------------------------------------

The core channel processing is a function that captures data into CSV files.

..  autofunction:: rss_status.channel_processing

Processing for All Channels
---------------------------

We can have a main program like this:

::

    def main():
        working_directory = Path.home() / "rss_feed" / "data"
        for channel_url in (
            "https://ecf.dcd.uscourts.gov/cgi-bin/rss_outside.pl",
            "https://ecf.nyed.uscourts.gov/cgi-bin/readyDockets.pl",
            # More channels here.
        ):
            channel_processing(channel_url, working_directory)

    if __name__ == "__main__":
        main()

The configuration is relatively simple and easy-to-see because it's right there in the script.

In some cases, we might want more elaborate command-line processing. In which case,
we can use the http://click.pocoo.org/5/ to build a more sophisticated command-line interface.