Python and Reverse Engineering, Part 4

At this point, we have clex.py, which uses sqlpreproc.py to create a proper lexer for C source code. We use PLY 's ANSI C parser (cparse.py) as the backbone of our own analysis of C.

The C Language.

Separate from the lexical structure of C (the spelling and punctuation part) and separate from the semantic structure of C (what it does at run-time), there's a syntactic structure of C. These are the constructs we "mean" when we write C source; these constructs have a structure at compile time, which generates behavior at run time.

Principally, a C source file is a series of declarations. We declare variables and functions. One of those functions, main, is special.

To model this, we'll define a module, language, which has class definitions for various syntax structures. We'll use this language module in our ANSI C parser (cparse) to build useful objects that we can use for reporting and analysis.

The Declaration.

Here's a simplistic superclass for the family tree of C declarations. This covers just enough structure to capture the essence of the declaration. We're not compiling, merely analyzing.

Since PLY makes it so easy to simply use tuples, many of the syntax rules will create normalized tuples with None filling in for missing or optional elements of the language.

class Declaration( object ):
    """Built from a 3-tuple: ( 'decl', type, declList )"""
    def __init__( self, typeSpec, declList ):
        self.typeSpec= typeSpec
        self.declList= declList
        self.body= [] # Empty = not a function declaration
        if self.declList == None: return
        for d in self.declList:
            if d[0] == '=':
                # 3-tuple: ( '=', decl, init )
                # analyze this declaration to get the variable name
                # analyze the initializer to locate function calls
                _, decl, init = d
                # Dig into the initializer and build an Expression.instance
                _, initBody= init
            elif d[0] == 'decl':
                # analyze this declaration to get the variable name
                decl= d
                init= None
            else:
                raise Exception( decl )
            # decl is 3-tuple( 'decl', pointer, directDecl )
            # init is 2-tuple( 'init', expr )
    def __str__( self ):
        return "%s %s" % ( self.typeSpec, self.declList )
    def symbol( self ):
        return None
    def references( self ):
        # TODO: initializers can involve function calls!
        # However, in C they have to be "constant" expressions which can be evaluated
        # at compile time, so the cross-references are rarely very interesting.
        # In C++ (and Java) they aren't restricted in this way.
        return []

Within cparse, we use this as follows. We'll extend the p_external_declaration_2 rule with a call to build a Declaration from the syntax accumulated in the parser state tuple, t.

import language as C

def p_external_declaration_2(t):
    'external_declaration : declaration'
    #print 'declaration', t[1]
    t[0]= C.Declaration( *t[1] )

The Function Declaration.

Really, we're interested in function declarations more than variables or references. A function declaration is a subclass of declaration that introduces some additional details, like a body, and some analysis methods.

class Function( Declaration ):
    """built from 5-tuple: ( 'def', specifiers, declarator, declaration_list, statement )"""
    def __init__( self, defStr, declaration_specifiers, declarator, declaration_list, compound_statement ):
        self.spec= declaration_specifiers
        _, self.ptr, direct_declarator = declarator
        if direct_declarator[0] == 'ddecl()':
            # New-style function declaration
            _, direct_declarator, self.args = direct_declarator
            _, self.name= direct_declarator
        else:
            self.name= declarator # actually a TUPLE of stuff
            self.args= declaration_list
        self.defs= compound_statement.decl
        self.body= compound_statement.body
    def __str__( self ):
        if self.defs:
            defsTxt=";\n".join( map( str, self.defs ) )
        else:
            defsTxt=""
        bodyTxt= ";\n".join( map( str, self.body ) )
        return "def %s %s (%s) {\n%s\n%s\n}" % (
            self.spec, self.name, self.args, defsTxt, bodyTxt )
    def symbol( self ):
        return self.name
    def references( self ):
        # TODO: The declarations in self.defs could involve function calls!
        refs= set()
        for stmt in self.body:
            #print " ", stmt
            refNames= [ r[1] for r in stmt.references() ]
            refs |= set(refNames)
        return refs

Within cparse, we use this as follows. We'll extend the p_external_declaration_1 rule with a call to build a Function from the syntax accumulated in the parser state tuple, t.

import language as C

def p_external_declaration_1(t):
    'external_declaration : function_definition'
    t[0]= C.Function( *t[1] )

Expressions.

The body of a function declaration is a sequence of statements. A statement either is an expression, or contains expressions. The expression is the lowest-level unit of grammar that we're interested in. Here's a declaration for an Expression class to support analysis of expressions in C.

class Expression( object ):
    def __init__( self, tree ):
        self.tree= tree
    def __str__( self ):
        return str( self.tree )
    def refsList( self ):
        # any calls? must dig recursively into the expression's syntax tree
        return self.walkTree( self.tree )
    def walkTree( self, aTree ):
        refs= []
        if isinstance(aTree,tuple) and aTree[0] == 'call':
            # AHA! - a function call
            refs.append( aTree )
        # Even if we found a call, descend into the arguments, also.
        if isinstance(aTree,tuple):
            for subExpr in aTree[1:]:
                if subExpr and isinstance(subExpr,tuple):
                    sub= self.walkTree( subExpr )
                    if sub: refs.extend( sub )
        return refs

While this could be used in cparse as each expression is parsed, we're too lazy to do that properly. Instead, we'll build expressions as part of assembling each Statement. The idea is to build a small syntax tree with only the parts we're going to analyze, ignoring numerous other details of the C language.

Statements .

C has a large number of statement types. We won't dig into each type, but will show a few representative types and how they are built by our parser.

The Statement superclass has the following definition.

class Statement( object ):
    def __init__( self, tree ):
        self.tree= tree
    def __str__( self ):
        return str( self.tree )
    def references( self ):
        # any calls? must dig recursively into the statement's syntax tree
        raise NotImplementedError( repr(self.tree) )

When we recognize a statement in the parser, we use the following factory function to map the syntax into a useful subclass of Statement. The global stmtFactory dictionary isn't complete, but it handles the statements in the 10,000 lines of source we're analyzing. Whenever we fail to find an appropriate subclass of Statement, we use the superclass, which (eventually) throws a NotImplementedError, and we can then define the needed Statement subclass.

stmtFactory = {
'{': CompoundStatement,
'return': Return,
'for': For,
'while': While,
'do': While, # Same structure, different semantics
'if': If,
'switch': Switch,
'case': Case,
'default': Default,
'break': Empty,
'continue': Empty,
'goto': Empty,
'cast': Cast,
'call': Call,
'+=': Assignment,
'-=': Assignment,
'=': Assignment,
'--': IncDec,
'++': IncDec,
'expr': ExprStmt,
}

def makeStatement( *args ):
    # Factory for subclasses of Statement
    try:
        cn= stmtFactory.setdefault( args[0], Statement )
        return cn( args )
    except TypeError, e:
        import sys, traceback
        print "***"
        print e
        print repr(tree)
        raise

Here's are two typical cparse rules for recognizing statements and using makeStatement to create a Statement instance.

# iteration_statement:
def p_iteration_statement_2(t):
    'iteration_statement : FOR LPAREN expression_opt SEMI expression_opt SEMI expression_opt RPAREN statement '
    t[0]= C.makeStatement( 'for', (t[3], t[5], t[7]), t[9] )

# expression-statement:
def p_expression_statement(t):
    'expression_statement : expression_opt SEMI'
    t[0]= C.makeStatement( 'expr', t[1] )

Statement Subclasses.

Rather than present all of the subclasses of Statement, here are two that match the iteration_statement and expression_statement syntax categories.

class For( Statement ):
    def __init__( self, tree ):
        super( For, self ).__init__( tree )
        _, exprTuple, self.body = self.tree
        ex1, ex2, ex3 = exprTuple
        self.ex1= Expression( ex1 )
        self.ex2= Expression( ex2 )
        self.ex3= Expression( ex3 )
    def references( self ):
        refs= self.ex1.refsList() + self.ex2.refsList() + self.ex3.refsList()
        refs.extend( self.body.references() )
        return refs

class ExprStmt( Statement ):
    def __init__( self, tree ):
        super( ExprStmt, self ).__init__( tree )
        _, expr= self.tree
        self.expr= Expression( expr )
    def references( self ):
        return self.expr.refsList()

How It Fits.

Here's a quick review of how the whole process fits together. Essentially, the main function is the parser, inside cparse.py. The parser is called yacc.parse, and is built secretly when cparse is imported. The parser consumes a sequence of tokens, produced by the lexer. When the parser recognizes a specific syntax construct, it executes the body of a function which is tied to that syntax rule. This function may create a Declaration, a Function or a call c.makeStatement to create an appropriate subclass of Statement. Some parser functions accumulate tuples of other syntax elements, saving them until the higher-level constructs get created.

The lexer, clex, is used by the parser to break C language source into individual tokens: keywords, identifiers, punctuation marks. The lexer, in turn, relies on sqlpreproc to handle the embedded SQL and CPP constructs mashed into the C source code.

Analytical Programs.

Here's an analytical program which examines the C source files the client gave us. Note the extension to the typedef handling described in Part 3 . As the parser trips over typedefs, we accumulate the list manually, rather than correctly hand new type names from parser to lexer. The parse function parses a single file, and returns the sequence of declarations (the syntax tree) in that file. The analyzeDefCall function examines each declaration looking for function definitions and function calls.

"""Parse and Analyze the legacy source."""

import clex
import cparse
import sqlpreproc
import language

import pprint, os.path

# HACK: rather than examine typedef statements, we simply force the type names
# into the lexical scanner.
clex.typedefs.extend( [ 'AccountType', 'PaymentType', 'SettleType',
    'LogLevel', 'Condition', 'MethodType', 'ConditionPtr',
    'VISIT',
    'Parameter', # subtle issue here--- this is a typedef in the .c file :-(
    'DocCombo',
    'InvoiceType', 'ModeType', 'StatusType', 'FlagType'
] )

def parse(fileName,debug=0):
    source= file(fileName,"r").read()
    makeFlags=['DYNAMICSQL','REREAD','SUB_COMMIT','MACRO_LOCK','MATCH_PATH']
    headerDefs=['USE_HIGHEST','PARAMETER_FILE']
    sqlText, statements = sqlpreproc.sqlpreproc(source)
    cppText, definitions = sqlpreproc.cpp(sqlText,set(headerDefs+makeFlags))
    print "SQL: ", len(statements)
    pprint.pprint( statements )
    tree= cparse.yacc.parse(cppText, debug=debug )
    return tree

def analyzeDefCall( tree ):
    print "----Analysis----"
    symbols= {}
    for decl in tree:
        if decl.symbol():
            symbols[decl.symbol()]= decl
    pprint.pprint( symbols )
    print "XREF"
    for decl in tree:
        if decl.body:
            print "%s\t%s" % ( decl.name, "\t".join(decl.references()) )
        else:
            print "Calls in", decl

def analyzeSource( dir, debug=0 ):
    import glob
    files= glob.glob(os.path.join(dir,'*.c'))
    for f in files:
        print
        print f
        tree= parse( f, debug )
        analyzeDefCall( tree )

The analyzeSource function is the top-level main analysis function.

Results.

We know that there are 162 distinct functions defined. We can, based on source file and other hints, narrow this to about 93 functions that are truly relevant. Within these functions, we can extract cross reference information that serves as a checklist to be sure that each function is completely analyzed.

Additionally, we can partition the function definitions into "primitive" and "moderate" and "complex". About one third of the definitions appear to be primitive functions that call two or fewer other functions, and seem to have a simple, clear purpose. Another third are complex functions that reference seven or more other functions, and are difficult to characterize. The remaining third of the functions call between three and six other functions, and are of moderate complexity.

This analysis helps us narrow our focus to about 20% of the function definitions (31 out of 162) which seem to do all the work. This is not a subjective evaluation, but is based on simple scanning of the syntax using extremely powerful analytical tools. Python allows us to rapidly modify, extend and adapt these tools, producing useful, relevant outputs with minimal effort.