Software Architecture¶
In the Approach section, we identified three steps. When we have multiple processing steps, we need to define multiple functions to implement those steps. The mapping isn’t exact because we’re refining abstractions into concrete expressions.
- The core RSS reader function. Pragmatically, this tends to expand into more than one function so we can separate the core XML parsing from add-on parsing that’s unique to the data we’re working with. For example, decomposing the title isn’t about RSS in general, it’s about this specific problem domain.
- A saved state getter. There’s not much to this, but it’s important to recover the information from the chalky slate where each day’s updates are recorded.
- A saved state writer. This can be pressed into service to write all of the files, since they’re structurally identical.
Let’s rough out the overall plan as conceptual pseudo-code shoing the original approach steps and how they’ve become functions that map arguments to results.
yesterdays_path, todays_path = path_maker("Some_RSS_URL")
# 1. Get Daily RSS Feed
todays_data = xml_reader("Some_RSS_URL")
saved_data = csv_load(yesterday_path / "save.csv")
# 2. Compare with saved history
new_data = set(todays_data) - set(saved_data)
# 3. Update saved history
all_data = set(todays_data) | set(saved_data)
csv_dump(new_data, today_path / "new.csv")
csv_dump(todays_data, today_path / "daily.csv")
csv_dump(all_data, today_path / "save.csv")
The above is purely conceptual. The idea is to outline a possible approach. As we delve into details, we’ll uncover places where this might be less than perfect. The final code will differ based on the things we learn along the way.
This core processing sequence can be performed for any number of RSS channels. The idea is to capture as many as necessary.
Some Design Patterns¶
One overall idea that is helpful is the Extract-Transform-Load (ETL) pipeline. This application involves ETL in miniature:
Extract raw data from it’s XML representation to build Python objects.
Transform the raw data from one class of Python objects to another class of objects.
Load the transformed data into a “database”. In this case, a bunch of directories and files. The conceptual “database load” is implemented as function to “dump” the data in CSV notation.
The terminology change is awkward, but Python uses “dump” and “load” as the verbs of choice for dumping python objects into an external file and loading Python objects from an external file.
Additionally, we’re working with several examples of Serialization. The concept is to create a series of bytes that represent a Python object. The source data was serialized in XML notation; we parse that to recover Python objects. The working data is serialized in CSV notation; we load and dump those files.
Core Data Structures¶
We have to address some technical nuance before going forward.
Important
Python sets work with immutable objects.
The csv
module works with mutable List or Dict objects for each row.
This will be aggravating.
We have two choices for working with the mutable results of reading CSV files.
- Implement our own versions of set subtraction and set union that work
with mutable objects. This has the advantage of enabling the huge flexibility
inherent in working with the
csv.DictReader
: each row becomes a dictionary with an indefinition collection of keys. - Transmogrify the mutable objects into immutable data structures. (The technical term is “coerce”, but I like transmogrification.) While this tends to limit our flexibility somewhat, it saves us from implementing set operations.
The transmogrification approach leads us to building a named tuple object for each CSV row.
-
class
rss_status.
SourceRSS
[source]¶ Extract of raw RSS data
-
description
¶ The description
-
link
¶ The link
-
pubDate
¶ The publication date
-
title
¶ The title
-
-
class
rss_status.
ExpandedRSS
[source]¶ Data expanded by the
title_transform()
function.Note that the names of the fields in this class will be the column titles on saved CSV files. Any change here will be reflected in the files created.
-
description
¶ The description
-
docket
¶ The parsed docket from the title
-
link
¶ The link
-
parties_title
¶ The parsed parties from the title
-
pubDate
¶ The publication date
-
title
¶ The title
-
XML Parsing¶
XML parsing is handled gracefully by Python’s xml
package. There are several
parses, the xml.etree
is particularly nice, and provides a wealth of features.
-
rss_status.
xml_reader
(url: str) → List[rss_status.SourceRSS][source]¶ Extract RSS data given a URL to read.
The root document is the
<rss>
tag which has a single<channel>
tag. The<channel>
has some overall attributes, but contains a sequence of<item>
tags.This will gather “title”, “link”, “description”, and “pubDate” from each item and build a
SourceRSS
object.It might be helpful to return the overall channel properties along with the list of items.
Parameters: url – URL to read. Returns: All of the SourceRSS from the channel of the feed, List[SourceRSS].
Creating Working Paths¶
We have two dimensions to the paths. The path_maker()
function will
honor this by using a name from the URL and today’s date.
-
rss_status.
path_maker
(url: str, now: datetime.datetime = None, format: str = '%Y%m%d') → pathlib.Path[source]¶ Builds a Path from today’s date and the base name of the URL.
The default format is “%Y%m%d” to transform the date to a YYYYmmdd string. An alternative can be “”%Y%W%w” to create a YYYYWWw string, where WW is the week of the year and w is the day of the week.
>>> from rss_status import path_maker >>> import datetime >>> now = datetime.datetime(2018, 9, 10) >>> str(path_maker("https://ecf.dcd.uscourts.gov/cgi-bin/rss_outside.pl", now)) '20180910/rss_outside'
Parameters: - url – An RSS-feed URL.
- now – Optional date/time object. Defaults to datetime.datetime.now().
Returns: A Path with the date string / base name from the URL.
What about yesterday’s files? We can’t simply subtract one day from today’s date. For that to work, we’d have to religiously run this every single day. That’s unreasonable. What’s easier is locating the most recent date which contains files for a given channel.
-
rss_status.
find_yesterday
(directory: pathlib.Path, url: str, date_pattern: str = '[0-9]*') → pathlib.Path[source]¶ We need to search for the most recent previous entry. While we can hope for dependably running this every day, that’s a difficult thing to guarantee.
It’s much more reliable to look for the most recent date which contains files for a given channel. This means
Example. Here’s two dates. One date has one channel, the other has two channels.
20180630/one_channel/daily.csv 20180630/one_channel/new.csv 20180630/one_channel/save.csv 20180701/one_channel/daily.csv 20180701/one_channel/new.csv 20180701/one_channel/save.csv 20180701/another_channel/daily.csv 20180701/another_channel/new.csv 20180701/another_channel/save.csv
If there’s nothing available, returns None.
Parameters: - directory – The base directory to search
- url – The full URL from which we can get the base name
- date_pattern – Most of the time, the interesting filenames will begin with a digit If the file name pattern is changed, however, this can be used to match dates, and exclude non-date files that might be confusing.
Returns: A Path with the date string / base name from the URL or None.
Saving and Recovering CSV State¶
These two functions will save and restore collections of ExpandedRSS
objects.
We use the names “dump” and “load” to be consistent with other serialization packages.
We could use json.dump()
or yaml.dump()
instead of writing in CSV notation.
-
rss_status.
csv_dump
(data: List[rss_status.ExpandedRSS], output_path: pathlib.Path) → None[source]¶ Save expanded data to a file, given the Path.
Note that the headers are the field names from the ExpandedRSS class definition. This assures us that all fields will be written properly.
Parameters: - data – List of
ExpandedRSS
items, built bytitle_transform()
. - output_path – Path to which to write the file.
- data – List of
-
rss_status.
csv_load
(input_path: pathlib.Path) → List[rss_status.ExpandedRSS][source]¶ Recover expanded data from a file, given a Path.
Note that the headers must be the field names from the ExpandedRSS class definition. If their isn’t a trivial match, then this won’t read properly.
Parameters: input_path – Path from which to read the file. Returns: List of ExpandedRSS objects used to compare previous day’s feed with today’s feed.
Transformations¶
Currently, we only have one transformation, from SourceRSS
to
ExpandedRSS
.
-
rss_status.
title_transform
(items: List[rss_status.SourceRSS]) → List[rss_status.ExpandedRSS][source]¶ A “transformation”: this will parse titles for court docket RSS feeds.
>>> from rss_status import title_transform, SourceRSS, ExpandedRSS
The data is a list witha single document [SourceRSS()]
>>> data = [ ... SourceRSS( ... title='1:15-cv-00791 SAVAGE v. BURWELL et al', ... link='https://ecf.dcd.uscourts.gov/cgi-bin/DktRpt.pl?172013', ... description='[Reply to opposition to motion] (<a href="https://ecf.dcd.uscourts.gov/doc1/04516660233?caseid=172013&de_seq_num=555" >137</a>)', ... pubDate='Thu, 05 Jul 2018 06:26:07 GMT' ... ), ... ] >>> title_transform(data) [ExpandedRSS(title='1:15-cv-00791 SAVAGE v. BURWELL et al', link='https://ecf.dcd.uscourts.gov/cgi-bin/DktRpt.pl?172013', description='[Reply to opposition to motion] (<a href="https://ecf.dcd.uscourts.gov/doc1/04516660233?caseid=172013&de_seq_num=555" >137</a>)', pubDate='Thu, 05 Jul 2018 06:26:07 GMT', docket='15-cv-00791', parties_title='SAVAGE v. BURWELL et al')]
Parameters: items – A list of SourceRSS
items built byxml_reader()
Returns: A new list of ExpandedRSS
, with some additional attributes for each item.
In the long run, there will be additional transformations.
Adding transormations means this single function does too many things. The current design has two elements combined together.
- Building a new list from individually transformed rows.
- Applying a single transformation process to create
ExpandedRSS
objects.
In the longer run (see Expansions) this needs to be refactored to allow multiple transformation processes to be combined by an over-arching transformation pipeline.
Composite Processing for One Channel¶
The core channel processing is a function that captures data into CSV files.
-
rss_status.
channel_processing
(url: str, directory: pathlib.Path = None, date: datetime.datetime = None)[source]¶ The daily process for a given channel.
Ideally there’s a “yesterday” directory. Pragmatically, this may not exist. We use
find_yesterday()
to track down the most recent file and work with that. If there’s no recent file, this is all new. Welcome.Parameters: - url – The URL for the channel
- directory – The working directory, default is the current working directory.
- date – The date to assign to the files, by default, it’s datetime.datetime.now.
Processing for All Channels¶
We can have a main program like this:
def main():
working_directory = Path.home() / "rss_feed" / "data"
for channel_url in (
"https://ecf.dcd.uscourts.gov/cgi-bin/rss_outside.pl",
"https://ecf.nyed.uscourts.gov/cgi-bin/readyDockets.pl",
# More channels here.
):
channel_processing(channel_url, working_directory)
if __name__ == "__main__":
main()
The configuration is relatively simple and easy-to-see because it’s right there in the script.
In some cases, we might want more elaborate command-line processing. In which case, we can use the http://click.pocoo.org/5/ to build a more sophisticated command-line interface.