Recently, we talked about extracting data from complex relational databases. This is -- in a way -- another case study for my Unlearning SQL book. This is a description of what comes next after the "low-level" conversion. Warning: it's complicated.

BLUF: Take the time to get rid of SQL processing.

In Part I, we loaded a database and queried the metadata. In Part II, we extracted the raw tables and loaded up a TAR Archive with NDJSON documents. In Part III, we prepared native Python objects that had a complete representation for the various kinds of tree structures. These include assets, categories, forum topics, image galleries, amongst other things. In Part IV, we talked about some applications to examine the converted data, looking for useful values, keys, and relationships.

We're going to skip a lot of the icky Joomla! details and focus on how to create something potentially useful.

The Goal

Recall from Part IV, we thought we had several steaming heaps of content on the legacy site. After exploration, we think we have the following:

  • A home page with a few articles.
  • A right sidebar with two articles.
  • A few content pages, each of which has links to a dozen or so narrowly-focused articles in a few categories.
  • The master collection of articles, neatly organized by category. There's a hierarchy here, a SQL nightmare we've avoided by restructuring the data.
  • The Kunena forums collection categories, topics, and messages. There's a hierarchy here, another SQL nightmare.
  • The JoomGallery collection of images. Hierarchy.
  • The Phoca collection of download files. You guessed it, another hierarchy.

These aren't the only hierarchues. Menus, modules, and assets have very tangled relationships, also. These are a SQL-query nightmare that we've turned into simple Python references among objects.

There's more, of course.

  • A hoard of Yahoo message-board posts which are not first-class parts of Joomla! but are first-class content.
  • Scans of the old paper newsletters. These, too, are not first-class parts of Joomla!, but are clearly very important content.

We'd like to dump all of this into a form that the Hugo tool can use to approximate the original site's content and structure. We're not going to spend too much time on the original look and feel; we can fuss with CSS to maybe match the color scheme.

What we've got are two separate kinds of things in the resulting site:

  • The "pages" which are Hugo Page Bundles with an _index.md and maybe some image resources. Each article becomes a page. In the case of the Home page -- which has multiple articles plastered onto it -- we will need a special-case template to include the bodies of multiple articles in one place. The Yahoo! messages are -- essentially -- articles that require some extra effort to convert.
  • The "collections" which are Hugo Sections, using an empty _index.md and a section-index generated by the template. The old newsletters are little more than downloads; these should be handled gracefully as a collection of Page Bundles.

We've also got some things we're going to set aside. Specifically, the right side-bar for articles is a waste of screen real-estate. It's not present for Forum or Photo Gallery.

Many of the Hugo themes have a 3-column look: the top-level menu is on the left, and the page table-of-contents is on the right. This seems to be somewhat more useful. One very spare Hugo theme is the Book theme, which seems like a good place to start.

The Processes

There are two separate kinds of migration processes:

  • Bulk migration of the collections. We have four, separate, unique subclasses.

    The Converter Class Hierarchy

    The Converter Class Hierarchy

  • Create the top-level pages that match the various pages of articles on the legacy site.

    The MakePage Class Hierarchy

    The MakePage Class Hierarchy

Note that the pages depend on the bulk-conversion results. The new path structure and new file names, and other details are (more-or-less) encapsulated in the converter classes.

Things That Aren't Easy

While Hugo handles a large number of special cases and exceptions gracefully, we have legacy content that's a bit of a mess. Some of the mess may be my inability to ferret out all of the details of the Joomla! data model. Other aspects of the mess also seem to be a result of the way Joomla! decides what's "published" and what's not "published."

First, and most obvious, we have HTML content. We can -- if we want -- generate HTML pages and leave the details to Hugo. In the long run, we'd like to move away from HTML. We'd really like to emphasize Markdown and make HTML an exception.

To do this, we state each page uses markdown, and wrap the HTML in {{<html>}}...{{</html>}} "short tags". This is -- well -- ugly. It sequesters the HTML in those few places where it's used.

  • descriptions for galleries and downloads.
  • articles.
  • forum messages.

It means a lot of code like this:

print()
print("{{<html>}}", article.fulltext, "{{</html>}}")

This gets us started with "safe" HTML everywhere. We can see a great deal of the site with this hack.

Hugo leaves HTML comments in places where unsafe HTML shows up. We can look for <!-- raw HTML omitted --> in the generated HTML and include needed wrappers.

What Else? Oh, Right, Section Index

The Book theme doesn't (by default) include section indexes as a default structure.

If there's a layout/_defaults/section.html, this is used for those _index.md pages that are clearly the top of a section tree.

We don't need to do anything more than define the template for the index. Here's what we started with:

{{ define "main" }}
  <main>
    {{ .Content }}

    {{ $pages := .Sections }}
    {{ $paginator := .Paginate $pages 25 }}

      <ul>
    {{ range $paginator.Pages }}
      <li><a href="{{ .RelPermalink }}">{{ .LinkTitle }}</a></li>
    {{ end }}
      </ul>

    {{ template "_internal/pagination.html" . }}
  </main>
{{ end }}

This doesn't sort things properly, so we need to add metadata with weighting.

Conclusion

Note the complexity of the migration.

There's not much that can be done to magically simplify all the special cases.

The time spent getting the database out of SQL and into Python objects gave us pleasantly simple Python objects to work with.

The class hierarchies evolved slowly. While it seems clear from the UML diagrams that these are "logical" designs, they didn't happen first. The initial design was not so clear and simple, leading to lots of redundant and inter-dependent code.

There's still a fair number of ultra-long methods that need to be decomposed into shorter, easier-to-understand methods. The remaining bugs involve two lost files and three index.php?... references that the link rewriter didn't handle correctly.