See "We have an ancient Python2 CGI script -- what do we do?" The previous post in this series provides an overview of the process of getting rid of legacy code.

Here's some code. I know it's painfully long; the point is to provide a super-specific, very concrete example of what to keep and what to discard. (I've omitted the module docstring and the imports.)

try:
    os.mkdir("data")
except OSError:
    pass

path_elements = os.environ["PATH_INFO"].split("/")
if path_elements[0] == "" and path_elements[1] == "resources":
    if os.environ["REQUEST_METHOD"] == "POST":
        type_name = path_elements[2]
        base = os.path.join("data", type_name)
        try:
            os.mkdir(base)
        except OSError:
            pass
        name = str(uuid.uuid4())
        full_name = os.path.join(base, name)
        data = cgi.parse(sys.stdin)
        output_file = open(full_name, 'w')
        output_file.write(repr(data))
        output_file.write('\n')
        output_file.close()

        print "Status: 201 CREATED"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Created New %s</title></head>" % type_name
        print "<body>"
        print "<h1>Created New %s</h1>" % type_name
        print "<p>Path: %s/%s</p>" % (type_name, name)
        print "<p>Content: </p><pre>"
        print data
        print "</pre>"
        print "</body>"
        # cgi.print_environ()
        print "</html>"
    elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 3:
        type_name = path_elements[2]
        print "Status: 200 OK"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Query %s</title></head>" % (type_name,)
        print "<body><h1>Create new instance of <tt>%s</tt></h1>" % type_name
        print '<form action="/cgi-bin/example.py/resources/%s" method="POST">' % (type_name,)
        print """
          <label for="fname">First name:</label>
          <input type="text" id="fname" name="fname"><br><br>
          <label for="lname">Last name:</label>
          <input type="text" id="lname" name="lname"><br><br>
          <input type="submit" value="Submit">
        """
        print "</form>"
        # cgi.print_environ()
        print "</body>"
        print "</html>"
    elif os.environ["REQUEST_METHOD"] == "GET" and len(path_elements) == 4:
        type_name = path_elements[2]
        resource_name = path_elements[3]
        full_name = os.path.join("data", type_name, resource_name)
        input_file = open(full_name, 'r')
        content = input_file.read()
        input_file.close()

        print "Status: 200 OK"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Document %s -- %s</title></head>" % (type_name, resource_name)
        print "<body><h1>Instance of <tt>%s</tt></h1>" % type_name
        print "<p>Path: %s/%s</p>" % (type_name, resource_name)
        print "<p>Content: </p><pre>"
        print content
        print "</pre>"
        print "</body>"
        # cgi.print_environ()
        print "</html>"
    else:
        print "Status: 403 Forbidden"
        print "Content-Type: text/html"
        print
        print "<!DOCTYPE html>"
        print "<html>"
        print "<head><title>Forbidden: %s to %s</title></head>"  % (os.environ["REQUEST_METHOD"], path_elements)
        cgi.print_environ()
        print "</html>"
else:
    print "Status: 404 Not Found"
    print "Content-Type: text/html"
    print                               # blank line, end of headers
    print "<!DOCTYPE html>"
    print "<html>"
    print "<head><title>Not Found: %s</title></head>" % (os.environ["PATH_INFO"], )
    print "<h1>Error</h1>"
    print "<b>Resource <tt>%s</tt> not found</b>" % (os.environ["PATH_INFO"], )
    cgi.print_environ()
    print "</html>"

At first glance you might notice (1) there are several resource types located on the URL path, and (2) there are several HTTP methods, also. These features aren't always obvious in a CGI script, and it's one of the reasons why CGI is simply horrible.

It's not clear from this what -- exactly -- the underlying data model is and what processing is done and what parts are merely CGI and HTML overheads.

This is why refactoring this code is absolutely essential to replacing it.

And.

We can't refactor without test cases.

And (bonus).

We can't have test cases without some vague idea of what this thing purports to do.

Let's tackle this in order. Starting with test cases.

Unit Test Cases

We can't unit test this.

As written, it's a top-level script without so much as as single def or class. This style of programming -- while legitimate Python -- is an epic fail when it comes to testing.

Step 1, then, is to refactor a script file into a module with function(s) or class(es) that can be tested.

def main():
    ... the original script ...

if __name__ == "__main__":  # pragma: no cover
    main()

For proper testability, there can be at most these two lines of code that are not easily tested. These two (and only these two) are marked with a special comment (# pragma: no cover) so the coverage tool can politely ignore the fact that we won't try to test these two lines.

We can now provide a os.environ values that look like a CGI requests, and exercise this script with concrete unit test cases.

How many things does it do?

Reading the code is headache-inducing, so, a fall-back plan is to count the number of logic paths. Look at if/elif blocks and count those without thinking too deeply about why the code looks the way it looks.

There appear to be five distinct behaviors. Since there are possibilities of unhandled exceptions, there may be as many as 10 things this will do in production.

This leads to a unit test that looks like the following:

import unittest
import urllib
import example_2
import os
import io
import sys

class MyTestCase(unittest.TestCase):
    def setUp(self):
        self.cwd = os.getcwd()
        try:
            os.mkdir("test_path")
        except OSError:
            pass
        os.chdir("test_path")
        self.output = io.BytesIO()
        sys.stdout = self.output
    def tearDown(self):
        sys.stdout = sys.__stdout__
        sys.stdin = sys.__stdin__
        os.chdir(self.cwd)
    def test_path_1(self):
        """No /resources in path"""
        os.environ["PATH_INFO"] = "/not/valid"
        os.environ["REQUEST_METHOD"] = "invalid"
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 404 Not Found")
    def test_path_2(self):
        """Path /resources but bad method"""
        os.environ["PATH_INFO"] = "/resources/example"
        os.environ["REQUEST_METHOD"] = "invalid"
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 403 Forbidden")
    def test_path_3(self):
        os.environ["PATH_INFO"] = "/resources/example"
        os.environ["REQUEST_METHOD"] = "GET"
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 200 OK")
        self.assertIn("<form ", out)
    def test_path_5(self):
        os.environ["PATH_INFO"] = "/resources/example"
        os.environ["REQUEST_METHOD"] = "POST"
        os.environ["CONTENT_TYPE"] = "application/x-www-form-urlencoded"
        content = urllib.urlencode({"field1": "value1", "field2": "value2"})
        form_data = io.BytesIO(content)
        os.environ["CONTENT_LENGTH"] = str(len(content))
        sys.stdin = form_data
        example_2.main()
        out = self.output.getvalue()
        first_line = out.splitlines()[0]
        self.assertEqual(first_line, "Status: 201 CREATED")
        self.assertIn("'field2': ['value2']", out)
        self.assertIn("'field1': ['value1']", out)


if __name__ == '__main__':
    unittest.main()

Does this have 100% code coverage? I'll leave it to the reader to copy-and-paste, add the coverage run command and look at the output. What else is required?

Integration Test Case

We can (barely) do an integration test on this. It's tricky because we don't want to run Apache httpd (or some other server.) We want to run a small Python script to be sure this works.

This means we need to (1) start a server as a separate process, and (2) use urllib to send requests to that separate process. This isn't too difficult. Right now, it's not obviously required. The test cases above run the entire script from end to end, providing what we think are appropriate mock values. Emphasis on "what we think." To be sure, we'll need to actually fire up a separate process.

As with the unit tests, we need to enumerate all of the expected behaviors.

Unlike the unit tests, there are (generally) fewer edge cases.

It looks like this.

import unittest
import subprocess
import time
import urllib2

class TestExample_2(unittest.TestCase):
    def setUp(self):
        self.proc = subprocess.Popen(
            ["python2.7", "mock_httpd.py"],
            cwd="previous"
        )
        time.sleep(0.25)
    def tearDown(self):
        self.proc.kill()
        time.sleep(0.1)
    def test(self):
        req = urllib2.Request("http://localhost:8000/cgi-bin/example.py/resources/example")
        result = urllib2.urlopen(req)
        self.assertEqual(result.getcode(), 200)
        self.assertEqual(set(result.info().keys()), set(['date', 'status', 'content-type', 'server']))
        content = result.read()
        self.assertEqual(content.splitlines()[0], "<!DOCTYPE html>")
        self.assertIn("<form ", content)

if __name__ == '__main__':
    unittest.main()

This will start a separate process and then make a request from that process. After the request, it kills the subprocess.

We've only covered one of the behaviors. A bunch more test cases are required. They're all going to be reasonably similar to the test() method.

Note the mock_httpd.py script. It's a tiny thing that invokes CGI's.

import CGIHTTPServer
import BaseHTTPServer

server_class = BaseHTTPServer.HTTPServer
handler_class = CGIHTTPServer.CGIHTTPRequestHandler

server_address = ('', 8000)
httpd = server_class(server_address, handler_class)
httpd.serve_forever()

This will run any script file in the cgi-bin directory, acting as a kind of mock for Apache httpd or other CGI servers.

Tests Pass, Now What?

We need to formalize our knowledge with a some diagrams. This is a Context diagram in PlantUML. It draws a picture that we can use to discuss what this app does and who actually uses it.

@startuml
actor user
usecase post
usecase query
usecase retrieve
user --> post
user --> query
user --> retrieve

usecase 404_not_found
usecase 403_not_permitted
user --> 404_not_found
user --> 403_not_permitted

retrieve <|-- 404_not_found
@enduml

We can also update the Container diagram. There's an "as-is" version and a "to-be" version.

Here's the as-is diagram of any CGI.

@startuml
interface HTTP

node "web server" {
    component httpd as  "Apache httpd"
    interface cgi
    component app
    component python
    python --> app
    folder data
    app --> data
}

HTTP --> httpd
httpd -> cgi
cgi -> python
@enduml

Here's a to-be diagram of a typical (small) Flask application.

@startuml
interface HTTP

node "web server" {
    component httpd as  "nginx"
    component uwsgi
    interface wsgi
    component python
    component app
    component model
    component flask
    component jinja
    folder data
    folder static
    httpd --> static
    python --> wsgi
    wsgi --> app
    app --> flask
    app --> jinja
    app -> model
    model --> data
}

HTTP --> httpd
httpd -> uwsgi
uwsgi -> python
@enduml

These diagrams can help to clarify how the CGI will be restructured. A complex CGI might have a database or external web services involved. These should be correctly depicted.

The previous post on this subject said we can now refactor this code. The unit tests are required before making any real changes. (Yes, we made one change to promote testability by repackaging a script to be a function.)

We're aimed to start disentangling the HTML and CGI overheads from the application and narrowing our focus onto the useful things it does.