pylit.py¶
Literate programming with reStructuredText¶
Date: | $Date$ |
---|---|
Revision: | $Revision$ |
URL: | $URL$ |
Copyright: | © 2005, 2007 Günter Milde. Released without warranty under the terms of the GNU General Public License (v. 2 or later) |
"""pylit: bidirectional text <-> code converter
Covert between a *text source* with embedded computer code and a *code source*
with embedded documentation.
"""
Contents
Frontmatter¶
Changelog¶
0.1 | 2005-06-29 | Initial version. |
0.1.1 | 2005-06-30 | First literate version. |
0.1.2 | 2005-07-01 | Object orientated script using generators. |
0.1.3 | 2005-07-10 | Two state machine (later added ‘header’ state). |
0.2b | 2006-12-04 | Start of work on version 0.2 (code restructuring). |
0.2 | 2007-01-23 | Published at http://pylit.berlios.de. |
0.2.1 | 2007-01-25 | Outsourced non-core documentation to the PyLit pages. |
0.2.2 | 2007-01-26 | New behaviour of diff function. |
0.2.3 | 2007-01-29 | New header methods after suggestion by Riccardo Murri. |
0.2.4 | 2007-01-31 | Raise Error if code indent is too small. |
0.2.5 | 2007-02-05 | New command line option –comment-string. |
0.2.6 | 2007-02-09 | Add section with open questions, Code2Text: let only blank lines (no comment str) separate text and code, fix Code2Text.header. |
0.2.7 | 2007-02-19 | Simplify Code2Text.header,
new iter_strip method replacing a lot of if -s. |
0.2.8 | 2007-02-22 | Set mtime of outfile to the one of infile. |
0.3 | 2007-02-27 | New Code2Text converter after an idea by Riccardo Murri, explicit option_defaults dict for easier customisation. |
0.3.1 | 2007-03-02 | Expand hard-tabs to prevent errors in indentation, Text2Code now also works on blocks, removed dependency on SimpleStates module. |
0.3.2 | 2007-03-06 | Bug fix: do not set language in option_defaults renamed code_languages to languages. |
0.3.3 | 2007-03-16 | New language css, option_defaults -> defaults = optparse.Values(), simpler PylitOptions: don’t store parsed values, don’t parse at initialisation, OptionValues: return None for non-existing attributes, removed -infile and -outfile, use positional arguments. |
0.3.4 | 2007-03-19 2007-03-21 |
Documentation update, separate execute function. Code cleanup in Text2Code.__iter__. |
0.3.5 | 2007-03-23 | Removed “css” from known languages after learning that there is no C++ style “// ” comment string in CSS2. |
0.3.6 | 2007-04-24 | Documentation update. |
0.4 | 2007-05-18 | Implement Converter.__iter__ as stack of iterator generators. Iterating over a converter instance now yields lines instead of blocks. Provide “hooks” for pre- and postprocessing filters. Rename states to reduce confusion with formats: “text” -> “documentation”, “code” -> “code_block”. |
0.4.1 | 2007-05-22 | Converter.__iter__: cleanup and reorganisation, rename parent class Converter -> TextCodeConverter. |
0.4.2 | 2007-05-23 | Merged Text2Code.converter and Code2Text.converter into TextCodeConverter.converter. |
0.4.3 | 2007-05-30 | Replaced use of defaults.code_extensions with values.languages.keys(). Removed spurious print statement in code_block_handler. Added basic support for ‘c’ and ‘css’ languages with dumb_c_preprocessor and dumb_c_postprocessor. |
0.5 | 2007-06-06 | Moved collect_blocks out of TextCodeConverter, bug fix: collect all trailing blank lines into a block. Expand tabs with expandtabs_filter. |
0.6 | 2007-06-20 | Configurable code-block marker (default :: ) |
0.6.1 | 2007-06-28 | Bug fix: reset self.code_block_marker_missing. |
0.7 | 2007-12-12 | prepending an empty string to sys.path in run_doctest() to allow imports from the current working dir. |
0.7.1 | 2008-01-07 | If outfile does not exist, do a round-trip conversion and report differences (as with outfile==’-‘). |
0.7.2 | 2008-01-28 | Do not add missing code-block separators with doctest_run on the code source. Keeps lines consistent. |
0.7.3 | 2008-04-07 | Use value of code_block_marker for insertion of missing transition marker in Code2Text.code_block_handler Add “shell” to defaults.languages |
0.7.4 | 2008-06-23 | Add “latex” to defaults.languages |
0.7.5 | 2009-05-14 | Bugfix: ignore blank lines in test for end of code block |
0.7.6 | 2009-12-15 | language-dependent code-block markers (after a feature request and patch by jrioux), use DefaultDict for language-dependent defaults, new defaults setting add_missing_marker. |
0.7.7 | 2010-06-23 | New command line option –codeindent. |
0.7.8 | 2011-03-30 | bugfix: do not overwrite custom add_missing_marker value, allow directive options following the ‘code’ directive. |
0.7.9 | 2011-04-05 | Decode doctest string if ‘magic comment’ gives encoding. |
3.1 | 2013-09-16 | Change to Python3: print statement and exec statements
removed. String formatting % changed to .format().
Replaced raise statements and except statements.
Upgrade for dict method changes.
Cleanup DefaultDict to be more easily replaced.
Adjust trailing space handling to match unit tests better.
Provide with statements for all file contexts. |
3.1.1 | 2019-04-05 | Add tox version checking. Change distribution sightly to fit better with PyPA standards |
To Do List¶
- Replace home-brewed DefaultDict with collections.defaultdict.
- Replace optparse with argparse.
- Use
pathlib
.
_version = "3.1.1"
__docformat__ = 'restructuredtext'
Introduction¶
PyLit is a bidirectional converter between two formats of a computer program source:
- a (reStructured) text document with program code embedded in code blocks, and
- a compilable (or executable) code source with documentation embedded in comment blocks
Requirements¶
import os, sys
import re
import contextlib
import warnings
#from collections import defaultdict # TODO
import optparse
#import argparse # TODO
DefaultDict¶
As collections.defaultdict is only introduced in Python 2.5, we define a Python3 compatible version of the dictionary with default from http://code.activestate.com/recipes/389639/
class DefaultDict(dict):
"""Minimalistic Dictionary with default value."""
def __init__(self, default_factory=lambda:None, *args, **kwargs):
self.update(dict(*args, **kwargs))
self.default_factory = default_factory
def __getitem__(self, key):
return self.get(key, self.default_factory() )
TODO: Replace with proper Python3 collections.defaultdict
Defaults¶
The defaults object provides a central repository for default values and their customisation.
defaults = optparse.Values()
It is used for
- the initialisation of data arguments in TextCodeConverter and PylitOptions
- completion of command line options in PylitOptions.complete_values.
This allows the easy creation of back-ends that customise the defaults and then call main e.g.:
>>> import pylit
>>> pylit.defaults.comment_string = "## "
>>> pylit.defaults.codeindent = 4
>>> pylit.main()
The following default values are defined in pylit.py:
languages¶
Mapping of code file extensions to code language:
defaults.languages = DefaultDict(lambda:"python", # fallback language
{".c": "c",
".cc": "c++",
".css": "css",
".py": "python",
".sh": "shell",
".sl": "slang",
".sty": "latex",
".tex": "latex"
})
Will be overridden by the --language
command line option.
The first argument is the fallback language, used if there is no
matching extension (e.g. if pylit is used as filter) and no
--language
is specified. It can be changed programmatically by
assignment to the .default
attribute, e.g.
>>> defaults.languages.default='c++'
text_extensions¶
List of known extensions of (reStructured) text files. The first extension in this list is used by the _get_outfile_name method to generate a text output filename:
defaults.text_extensions = [".txt", ".rst"]
fs¶
Comment strings for known languages. Used in Code2Text to recognise
text blocks and in Text2Code to format text blocks as comments.
Defaults to '# '
.
Comment strings include trailing whitespace.
defaults.comment_strings = DefaultDict(lambda:'# ',
{"css": '// ',
"c": '// ',
"c++": '// ',
"latex": '% ',
"python": '# ',
"shell": '# ',
"slang": '% '
})
header_string¶
Marker string for a header code block in the text source. No trailing
whitespace needed as indented code follows.
Must be a valid rst directive that accepts code on the same line, e.g.
'..admonition::'
.
Default is a comment marker:
defaults.header_string = '..'
code_block_markers¶
Markup at the end of a documentation block. Default is Docutils’ marker for a literal block:
defaults.code_block_markers = DefaultDict(lambda:'::')
The code_block_marker string is inserted into a regular expression. Language-specific markers can be defined programmatically, e.g. in a wrapper script.
In a document where code examples are only one of several uses of literal blocks, it is more appropriate to single out the source code ,e.g. with the double colon at a separate line (“expanded form”)
defaults.code_block_marker.default = ':: *'
or a dedicated .. code-block::
directive [1]
defaults.code_block_marker['c++'] = '.. code-block:: *c++'
The latter form also allows code in different languages kept together in one literate source file.
[1] | The .. code-block:: directive is not (yet) supported by
standard Docutils. It is provided by several add-ons, including
the code-block directive project in the Docutils Sandbox and
Sphinx. |
strip_marker¶
Strip literal marker from the end of documentation blocks when converting to code format. Makes the code more concise but looses the synchronisation of line numbers in text and code formats. Can also be used (together with the auto-completion of the code-text conversion) to change the code_block_marker:
defaults.strip_marker = False
add_missing_marker¶
When converting from code format to text format, add a code_block_marker at the end of documentation blocks if it is missing:
defaults.add_missing_marker = True
Keep this at True
, if you want to re-convert to code format later!
preprocessors¶
Preprocess the data with language-specific filters Set below in Filters:
defaults.preprocessors = {}
codeindent¶
Number of spaces to indent code blocks in Code2Text.code_block_handler:
defaults.codeindent = 2
In Text2Code.code_block_handler, the codeindent is determined by the first recognised code line (header or first indented literal block of the text source).
Converter Classes¶
The converter classes implement a simple state machine to separate and transform documentation and code blocks. For this task, only a very limited parsing is needed. PyLit’s parser assumes:
- indented literal blocks in a text source are code blocks.
- comment blocks in a code source where every line starts with a matching comment string are documentation blocks.
TextCodeConverter¶
class TextCodeConverter:
"""Parent class for the converters `Text2Code` and `Code2Text`.
"""
The parent class defines data attributes and functions used in both Text2Code converting a text source to executable code source, and Code2Text converting commented code to a text source.
Data attributes¶
Class default values are fetched from the defaults object and can be overridden by matching keyword arguments during class instantiation. This also works with keyword arguments to get_converter and main, as these functions pass on unused keyword args to the instantiation of a converter class.
language = defaults.languages.default_factory()
comment_strings = defaults.comment_strings
comment_string = "" # set in __init__ (if empty)
codeindent = defaults.codeindent
header_string = defaults.header_string
code_block_markers = defaults.code_block_markers
code_block_marker = "" # set in __init__ (if empty)
strip = defaults.strip
strip_marker = defaults.strip_marker
add_missing_marker = defaults.add_missing_marker
directive_option_regexp = re.compile(r' +:(\w|[-._+:])+:( |$)')
state = "" # type of current block, see `TextCodeConverter.convert`_
Interface methods¶
__init__¶
Initialising sets the data attribute, an iterable object yielding lines of the source to convert. [2]
[2] | The most common choice of data is a file object with the text or code source. To convert a string into a suitable object, use its splitlines method
like |
Additional keyword arguments are stored as instance variables, overwriting the class defaults:
def __init__(self, data, **keyw):
"""data -- iterable data object
(list, file, generator, string, ...)
**keyw -- remaining keyword arguments are
stored as data-attributes
"""
self.data = data
self.__dict__.update(keyw)
If empty, code_block_marker and comment_string are set according to the language:
if not self.code_block_marker:
self.code_block_marker = self.code_block_markers[self.language]
if not self.comment_string:
self.comment_string = self.comment_strings[self.language]
self.stripped_comment_string = self.comment_string.rstrip()
Pre- and post-processing filters are set (with TextCodeConverter.get_filter):
self.preprocessor = self.get_filter("preprocessors", self.language)
self.postprocessor = self.get_filter("postprocessors", self.language)
Finally, a regular_expression for the code_block_marker is compiled
to find valid cases of code_block_marker in a given line and return
the groups: \1 prefix, \2 code_block_marker, \3 remainder
marker = self.code_block_marker
if marker == '::':
# the default marker may occur at the end of a text line
self.marker_regexp = re.compile('^( *(?!\.\.).*)(::)([ \n]*)$')
else:
# marker must be on a separate line
self.marker_regexp = re.compile('^( *)(%s)(.*\n?)$' % marker)
__iter__¶
Return an iterator for the instance. Iteration yields lines of converted data.
The iterator is a chain of iterators acting on self.data that does
- preprocessing
- text<->code format conversion
- postprocessing
Pre- and post- processing are only performed if filters for the current language are registered in defaults.preprocessors and|or defaults.postprocessors. The filters must accept an iterable as first argument and yield the processed input data line-wise.
def __iter__(self):
"""Iterate over input data source and yield converted lines
"""
return self.postprocessor(self.convert(self.preprocessor(self.data)))
Helpers and convenience methods¶
convert¶
The convert method generates an iterator that does the actual code <–> text format conversion. The converted data is yielded line-wise and the instance’s status argument indicates whether the current line is “header”, “documentation”, or “code_block”:
def convert(self, lines):
"""Iterate over lines of a program document and convert
between "text" and "code" format
"""
Initialise internal data arguments. (Done here, so that every new iteration re-initialises them.)
- state
the “type” of the currently processed block of lines. One of
“”: initial state: check for header, “header”: leading code block: strip header_string, “documentation”: documentation part: comment out, “code_block”: literal blocks containing source code: unindent.
self.state = ""
- _codeindent
- Do not confuse the internal attribute _codeindent with the configurable codeindent (without the leading underscore).
- _codeindent is set in Text2Code.code_block_handler to the indent of first non-blank “code_block” line and stripped from all “code_block” lines in the text-to-code conversion,
- codeindent is set in __init__ to defaults.codeindent and added to “code_block” lines in the code-to-text conversion.
self._codeindent = 0
- _textindent
- set by Text2Code.documentation_handler to the minimal indent of a documentation block,
- used in Text2Code.set_state to find the end of a code block.
self._textindent = 0
- _add_code_block_marker
If the last paragraph of a documentation block does not end with a code_block_marker, it should be added (otherwise, the back-conversion fails.).
_add_code_block_marker is set by Code2Text.documentation_handler and evaluated by Code2Text.code_block_handler, because the documentation_handler does not know whether the next block will be documentation (with no need for a code_block_marker) or a code block.
self._add_code_block_marker = False
Determine the state of the block and convert with the matching “handler”:
for block in collect_blocks(expandtabs_filter(lines)):
try:
self.set_state(block)
except StopIteration:
return
for line in getattr(self, self.state+"_handler")(block):
yield line
get_filter¶
def get_filter(self, filter_set, language):
"""Return language specific filter"""
if self.__class__ == Text2Code:
key = "text2"+language
elif self.__class__ == Code2Text:
key = language+"2text"
else:
key = ""
try:
return getattr(defaults, filter_set)[key]
except (AttributeError, KeyError):
# print( "there is no {0!r} filter in {1!r}".format(key, filter_set) )
pass
return identity_filter
get_indent¶
Return the number of leading spaces in line:
def get_indent(self, line):
"""Return the indentation of `string`.
"""
return len(line) - len(line.lstrip())
Text2Code¶
The Text2Code converter separates code-blocks [3] from documentation.
Code blocks are unindented, documentation is commented (or filtered, if the
strip
option is True).
[3] | Only indented literal blocks are considered code-blocks. quoted literal blocks, parsed-literal blocks, and doctest blocks are treated as part of the documentation. This allows the inclusion of examples: >>> 23 + 3
26
Mark that there is no double colon before the doctest block in the text source. |
The class inherits the interface and helper functions from TextCodeConverter and adds functions specific to the text-to-code format conversion:
class Text2Code(TextCodeConverter):
"""Convert a (reStructured) text source to code source
"""
set_state¶
def set_state(self, block):
"""Determine state of `block`. Set `self.state`.
"""
set_state()
is used inside an iteration. Hence, if we are out of data, a
StopItertion
exception should be raised:
if not block:
raise StopIteration
The new state depends on the active state (from the last block) and features of the current block. It is either “header”, “documentation”, or “code_block”.
If the current state is “” (first block), check for the header_string indicating a leading code block:
if self.state == "":
# print( "set state for {0!r}".format(block) )
if block[0].startswith(self.header_string):
self.state = "header"
else:
self.state = "documentation"
If the current state is “documentation”, the next block is also documentation. The end of a documentation part is detected in the Text2Code.documentation_handler:
# elif self.state == "documentation":
# self.state = "documentation"
A “code_block” ends with the first less indented, non-blank line. _textindent is set by the documentation handler to the indent of the preceding documentation block:
elif self.state in ["code_block", "header"]:
indents = [self.get_indent(line) for line in block
if line.rstrip()]
# print "set_state:", indents, self._textindent
if indents and min(indents) <= self._textindent:
self.state = 'documentation'
else:
self.state = 'code_block'
TODO: (or not to do?) insert blank line before the first line with too-small codeindent using self.ensure_trailing_blank_line(lines, line) (would need split and push-back of the documentation part)?
header_handler¶
Sometimes code needs to remain on the first line(s) of the document to be valid. The most common example is the “shebang” line that tells a POSIX shell how to process an executable file:
#!/usr/bin/env python
In Python, the special comment to indicate the encoding, e.g.
# -*- coding: iso-8859-1 -*-
, must occur before any other comment
or code too.
If we want to keep the line numbers in sync for text and code source, the
reStructured Text markup for these header lines must start at the same line
as the first header line. Therefore, header lines could not be marked as
literal block (this would require the ::
and an empty line above the
code_block).
OTOH, a comment may start at the same line as the comment marker and it includes subsequent indented lines. Comments are visible in the reStructured Text source but hidden in the pretty-printed output.
With a header converted to comment in the text source, everything before the first documentation block (i.e. before the first paragraph using the matching comment string) will be hidden away (in HTML or PDF output).
This seems a good compromise, the advantages
- line numbers are kept
- the “normal” code_block conversion rules (indent/unindent by codeindent apply
- greater flexibility: you can hide a repeating header in a project consisting of many source files.
set off the disadvantages
- it may come as surprise if a part of the file is not “printed”,
- one more syntax element to learn for rst newbies to start with pylit, (however, starting from the code source, this will be auto-generated)
In the case that there is no matching comment at all, the complete code source will become a comment – however, in this case it is not very likely the source is a literate document anyway.
If needed for the documentation, it is possible to quote the header in (or after) the first documentation block, e.g. as parsed literal.
def header_handler(self, lines):
"""Format leading code block"""
# strip header string from first line
lines[0] = lines[0].replace(self.header_string, "", 1)
# yield remaining lines formatted as code-block
for line in self.code_block_handler(lines):
yield line
documentation_handler¶
The ‘documentation’ handler processes everything that is not recognised as “code_block”. Documentation is quoted with self.comment_string (or filtered with –strip=True).
If end-of-documentation marker is detected,
- set state to ‘code_block’
- set self._textindent (needed by Text2Code.set_state to find the next “documentation” block)
def documentation_handler(self, lines):
"""Convert documentation blocks from text to code format
"""
for line in lines:
# test lines following the code-block marker for false positives
if (self.state == "code_block" and line.rstrip()
and not self.directive_option_regexp.search(line)):
self.state = "documentation"
# test for end of documentation block
if self.marker_regexp.search(line):
self.state = "code_block"
self._textindent = self.get_indent(line)
# yield lines
if self.strip:
continue
# do not comment blank lines preceding a code block
if self.state == "code_block" and not line.rstrip():
yield line
else:
# Avoid a trailing space on a comment-only line.
if line.rstrip():
yield self.comment_string + line
else:
yield self.stripped_comment_string + line
code_block_handler¶
The “code_block” handler is called with an indented literal block. It removes leading whitespace up to the indentation of the first code line in the file (this deviation from Docutils behaviour allows indented blocks of Python code).
def code_block_handler(self, block):
"""Convert indented literal blocks to source code format
"""
If still unset, determine the indentation of code blocks from first non-blank code line:
if self._codeindent == 0:
self._codeindent = self.get_indent(block[0])
Yield unindented lines after check whether we can safely unindent. If the line is less indented then _codeindent, something got wrong.
for line in block:
if line.lstrip() and self.get_indent(line) < self._codeindent:
raise ValueError("code block contains line less indented " \
"than {0:d} spaces \n{1!r}".format(self._codeindent, block))
yield line.replace(" "*self._codeindent, "", 1)
Code2Text¶
The Code2Text converter does the opposite of Text2Code – it processes a source in “code format” (i.e. in a programming language), extracts documentation from comment blocks, and puts program code in literal blocks.
The class inherits the interface and helper functions from TextCodeConverter and adds functions specific to the text-to-code format conversion:
class Code2Text(TextCodeConverter):
"""Convert code source to text source
"""
set_state¶
Check if block is “header”, “documentation”, or “code_block”:
A paragraph is “documentation”, if every non-blank line starts with a matching comment string (including whitespace except for commented blank lines)
def set_state(self, block):
"""Determine state of `block`."""
for line in block:
# skip documentation lines (commented, blank or blank comment)
if (line.startswith(self.comment_string)
or not line.rstrip()
or line.rstrip() == self.comment_string.rstrip()
):
continue
# non-commented line found:
if self.state == "":
self.state = "header"
else:
self.state = "code_block"
break
else:
# no code line found
# keep state if the block is just a blank line
# if len(block) == 1 and self._is_blank_codeline(line):
# return
self.state = "documentation"
header_handler¶
Handle a leading code block. (See Text2Code.header_handler for a discussion of the “header” state.)
def header_handler(self, lines):
"""Format leading code block"""
if self.strip == True:
return
# get iterator over the lines that formats them as code-block
lines = iter(self.code_block_handler(lines))
# prepend header string to first line
yield self.header_string + next(lines)
# yield remaining lines
for line in lines:
yield line
documentation_handler¶
The documentation state handler converts a comment to a documentation block by stripping the leading comment string from every line:
def documentation_handler(self, block):
"""Uncomment documentation blocks in source code
"""
Strip comment strings:
lines = [self.uncomment_line(line) for line in block]
If the code block is stripped, the literal marker would lead to an error when the text is converted with Docutils. Strip it as well.
if self.strip or self.strip_marker:
self.strip_code_block_marker(lines)
Otherwise, check for the code_block_marker at the end of the documentation block (skipping directive options that might follow it):
elif self.add_missing_marker:
for line in lines[::-1]:
if self.marker_regexp.search(line):
self._add_code_block_marker = False
break
if (line.rstrip() and
not self.directive_option_regexp.search(line)):
self._add_code_block_marker = True
break
else:
self._add_code_block_marker = True
Yield lines:
for line in lines:
yield line
uncomment_line¶
Return documentation line after stripping comment string. Consider the case that a blank line has a comment string without trailing whitespace:
def uncomment_line(self, line):
"""Return uncommented documentation line"""
line = line.replace(self.comment_string, "", 1)
if line.rstrip() == self.stripped_comment_string:
line = line.replace(self.stripped_comment_string, "", 1)
return line
code_block_handler¶
The code_block handler returns the code block as indented literal
block (or filters it, if self.strip == True
). The amount of the code
indentation is controlled by self.codeindent (default 2).
def code_block_handler(self, lines):
"""Covert code blocks to text format (indent or strip)
"""
if self.strip == True:
return
# eventually insert transition marker
if self._add_code_block_marker:
self.state = "documentation"
yield self.code_block_marker + "\n"
yield "\n"
self._add_code_block_marker = False
self.state = "code_block"
for line in lines:
# Avoid a trailing space on a blank line.
if line.rstrip():
yield " "*self.codeindent + line
else:
yield line
strip_code_block_marker¶
Replace the literal marker with the equivalent of Docutils replace rules
- strip
::
-line (and preceding blank line) if on a line on its own - strip
::
if it is preceded by whitespace. - convert
::
to a single colon if preceded by text
lines is a list of documentation lines (with a trailing blank line). It is modified in-place:
def strip_code_block_marker(self, lines):
try:
line = lines[-2]
except IndexError:
return # just one line (no trailing blank line)
# match with regexp: `match` is None or has groups
# \1 leading text, \2 code_block_marker, \3 remainder
match = self.marker_regexp.search(line)
if not match: # no code_block_marker present
return
if not match.group(1): # `code_block_marker` on an extra line
del(lines[-2])
# delete preceding line if it is blank
if len(lines) >= 2 and not lines[-2].lstrip():
del(lines[-2])
elif match.group(1).rstrip() < match.group(1):
# '::' follows whitespace
lines[-2] = match.group(1).rstrip() + match.group(3)
else: # '::' follows text
lines[-2] = match.group(1).rstrip() + ':' + match.group(3)
Filters¶
Filters allow pre- and post-processing of the data to bring it in a format
suitable for the “normal” text<->code conversion. An example is conversion
of C /*
*/
comments into C++ //
comments (and back).
Another example is the conversion of C /*
*/
comments into C++
//
comments (and back).
Filters are generator functions that return an iterator acting on a data iterable and yielding processed data lines.
identity_filter¶
The most basic filter is the identity filter, that returns its argument as iterator:
def identity_filter(data):
"""Return data iterator without any processing"""
return iter(data)
expandtabs_filter¶
Expand hard-tabs in every line of data (cf. str.expandtabs).
This filter is applied to the input data by TextCodeConverter.convert as hard tabs can lead to errors when the indentation is changed.
def expandtabs_filter(data):
"""Yield data tokens with hard-tabs expanded"""
for line in data:
yield line.expandtabs()
collect_blocks¶
A filter to aggregate “paragraphs” (blocks separated by blank lines). Yields lists of lines:
def collect_blocks(lines):
"""collect lines in a list
yield list for each paragraph, i.e. block of lines separated by a
blank line (whitespace only).
Trailing blank lines are collected as well.
"""
blank_line_reached = False
block = []
for line in lines:
if blank_line_reached and line.rstrip():
yield block
blank_line_reached = False
block = [line]
continue
if not line.rstrip():
blank_line_reached = True
block.append(line)
yield block
dumb_c_preprocessor¶
This is a basic filter to convert C to C++ comments. Works line-wise and only converts lines that
- start with “/* ” and end with ” */” (followed by whitespace only)
A more sophisticated version would also
- convert multi-line comments
- Keep indentation or strip 3 leading spaces?
- account for nested comments
- only convert comments that are separated from code by a blank line
def dumb_c_preprocessor(data):
"""change `C` ``/* `` `` */`` comments into C++ ``// `` comments"""
comment_string = defaults.comment_strings["c++"]
boc_string = "/* "
eoc_string = " */"
for line in data:
if (line.startswith(boc_string)
and line.rstrip().endswith(eoc_string)
):
line = line.replace(boc_string, comment_string, 1)
line = "".join(line.rsplit(eoc_string, 1))
yield line
Unfortunately, the replace method of strings does not support negative numbers for the count argument:
>>> "foo */ baz */ bar".replace(" */", "", -1) == "foo */ baz bar"
False
However, there is the rsplit method, that can be used together with join:
>>> "".join("foo */ baz */ bar".rsplit(" */", 1)) == "foo */ baz bar"
True
dumb_c_postprocessor¶
Undo the preparations by the dumb_c_preprocessor and re-insert valid comment delimiters
def dumb_c_postprocessor(data):
"""change C++ ``// `` comments into `C` ``/* `` `` */`` comments"""
comment_string = defaults.comment_strings["c++"]
boc_string = "/* "
eoc_string = " */"
for line in data:
if line.rstrip() == comment_string.rstrip():
line = line.replace(comment_string, "", 1)
elif line.startswith(comment_string):
line = line.replace(comment_string, boc_string, 1)
line = line.rstrip() + eoc_string + "\n"
yield line
register filters¶
defaults.preprocessors['c2text'] = dumb_c_preprocessor
defaults.preprocessors['css2text'] = dumb_c_preprocessor
defaults.postprocessors['text2c'] = dumb_c_postprocessor
defaults.postprocessors['text2css'] = dumb_c_postprocessor
Command line use¶
Using this script from the command line will convert a file according to its extension. This default can be overridden by a couple of options.
Dual source handling¶
How to determine which source is up-to-date?¶
set modification date of outfile to the one of infile
Points out that the source files are ‘synchronised’.
Are there problems to expect from “backdating” a file? Which?
Looking at http://www.unix.com/showthread.php?t=20526, it seems perfectly legal to set mtime (while leaving ctime) as mtime is a description of the “actuality” of the data in the file.
Should this become a default or an option?
alternatively move input file to a backup copy (with option: –replace)
check modification date before overwriting (with option: –overwrite=update)
check modification date before editing (implemented as Jed editor function pylit_check() in pylit.sl)
Recognised Filename Extensions¶
Instead of defining a new extension for “pylit” literate programs,
by default .txt
will be appended for the text source and stripped by
the conversion to the code source. I.e. for a Python program foo:
- the code source is called
foo.py
- the text source is called
foo.py.txt
- the html rendering is called
foo.py.html
OptionValues¶
The following class adds as_dict, complete and __getattr__ methods to optparse.Values:
class OptionValues(optparse.Values):
as_dict¶
For use as keyword arguments, it is handy to have the options in a dictionary. as_dict returns a copy of the instances object dictionary:
def as_dict(self):
"""Return options as dictionary object"""
return self.__dict__.copy()
complete¶
def complete(self, **keyw):
"""
Complete the option values with keyword arguments.
Do not overwrite existing values. Only use arguments that do not
have a corresponding attribute in `self`,
"""
for key in keyw:
if key not in self.__dict__:
setattr(self, key, keyw[key])
__getattr__¶
To replace calls using options.ensure_value("OPTION", None)
with the
more concise options.OPTION
, we define __getattr__ [4]
def __getattr__(self, name):
"""Return default value for non existing options"""
return None
[4] | The special method __getattr__ is only called when an attribute look-up has not found the attribute in the usual places (i.e. it is not an instance attribute nor is it found in the class tree for self). |
PylitOptions¶
The PylitOptions class comprises an option parser and methods for parsing and completion of command line options:
class PylitOptions:
"""Storage and handling of command line options for pylit"""
Instantiation¶
def __init__(self):
"""Set up an `OptionParser` instance for pylit command line options
"""
p = optparse.OptionParser(usage=main.__doc__, version=_version)
# Conversion settings
p.add_option("-c", "--code2txt", dest="txt2code", action="store_false",
help="convert code source to text source")
p.add_option("-t", "--txt2code", action="store_true",
help="convert text source to code source")
p.add_option("--language",
choices = list(defaults.languages.values()),
help="use LANGUAGE native comment style")
p.add_option("--comment-string", dest="comment_string",
help="documentation block marker in code source "
"(including trailing whitespace, "
"default: language dependent)")
p.add_option("-m", "--code-block-marker", dest="code_block_marker",
help="syntax token starting a code block. (default '::')")
p.add_option("--codeindent", type="int",
help="Number of spaces to indent code blocks with "
"text2code (default {0:d})".format(defaults.codeindent))
# Output file handling
p.add_option("--overwrite", action="store",
choices = ["yes", "update", "no"],
help="overwrite output file (default 'update')")
p.add_option("--replace", action="store_true",
help="move infile to a backup copy (appending '~')")
p.add_option("-s", "--strip", action="store_true",
help='"export" by stripping documentation or code')
# Special actions
p.add_option("-d", "--diff", action="store_true",
help="test for differences to existing file")
p.add_option("--doctest", action="store_true",
help="run doctest.testfile() on the text version")
p.add_option("-e", "--execute", action="store_true",
help="execute code (Python only)")
self.parser = p
parse_args¶
The parse_args method calls the optparse.OptionParser on command line or provided args and returns the result as PylitOptions.Values instance. Defaults can be provided as keyword arguments:
def parse_args(self, args=sys.argv[1:], **keyw):
"""parse command line arguments using `optparse.OptionParser`
parse_args(args, **keyw) -> OptionValues instance
args -- list of command line arguments.
keyw -- keyword arguments or dictionary of option defaults
"""
# parse arguments
(values, args) = self.parser.parse_args(args, OptionValues(keyw))
# Convert FILE and OUTFILE positional args to option values
# (other positional arguments are ignored)
try:
values.infile = args[0]
values.outfile = args[1]
except IndexError:
pass
return values
complete_values¶
Complete an OptionValues instance values. Use module-level defaults and context information to set missing option values to sensible defaults (if possible)
def complete_values(self, values):
"""complete option values with module and context sensible defaults
x.complete_values(values) -> values
values -- OptionValues instance
"""
Complete with module-level defaults:
values.complete(**defaults.__dict__)
Ensure infile is a string:
values.ensure_value("infile", "")
Guess conversion direction from infile filename:
if values.txt2code is None:
in_extension = os.path.splitext(values.infile)[1]
if in_extension in values.text_extensions:
values.txt2code = True
elif in_extension in values.languages.keys():
values.txt2code = False
Auto-determine the output file name:
values.ensure_value("outfile", self._get_outfile_name(values))
Second try: Guess conversion direction from outfile filename:
if values.txt2code is None:
out_extension = os.path.splitext(values.outfile)[1]
values.txt2code = not (out_extension in values.text_extensions)
Set the language of the code:
if values.txt2code is True:
code_extension = os.path.splitext(values.outfile)[1]
elif values.txt2code is False:
code_extension = os.path.splitext(values.infile)[1]
values.ensure_value("language", values.languages[code_extension])
return values
_get_outfile_name¶
Construct a matching filename for the output file. The output filename is constructed from infile by the following rules:
- ‘-‘ (stdin) results in ‘-‘ (stdout)
- strip the text_extension (txt2code) or
- add the text_extension (code2txt)
- fallback: if no guess can be made, add “.out”
def _get_outfile_name(self, values):
"""Return a matching output filename for `infile`
"""
# if input is stdin, default output is stdout
if values.infile == '-':
return '-'
# Derive from `infile` name: strip or add text extension
(base, ext) = os.path.splitext(values.infile)
if ext in values.text_extensions:
return base # strip
if ext in values.languages.keys() or values.txt2code == False:
return values.infile + values.text_extensions[0] # add
# give up
return values.infile + ".out"
__call__¶
The special __call__ method allows to use PylitOptions instances as callables: Calling an instance parses the argument list to extract option values and completes them based on “context-sensitive defaults”. Keyword arguments are passed to PylitOptions.parse_args as default values.
def __call__(self, args=sys.argv[1:], **keyw):
"""parse and complete command line args return option values
"""
values = self.parse_args(args, **keyw)
return self.complete_values(values)
Helper functions¶
open_streams¶
Return file objects for in- and output. If the input path is missing, write usage and abort. (An alternative would be to use stdin as default. However, this leaves the uninitiated user with a non-responding application if (s)he just tries the script without any arguments)
Note that the output file encoding is important in Python3. Taking the default may not match the encoding in the file comments.
@contextlib.contextmanager
def open_streams(infile = '-', outfile = '-', overwrite='update', **keyw):
"""Open and return the input and output stream
open_streams(infile, outfile) -> (in_stream, out_stream)
in_stream -- open(infile) or sys.stdin
out_stream -- open(outfile) or sys.stdout
overwrite -- 'yes': overwrite eventually existing `outfile`,
'update': fail if the `outfile` is newer than `infile`,
'no': fail if `outfile` exists.
Irrelevant if `outfile` == '-'.
"""
if not infile:
strerror = "Missing input file name ('-' for stdin; -h for help)"
raise IOError(2, strerror, infile)
if infile == '-':
in_stream = sys.stdin
else:
in_stream = open(infile, 'r')
if outfile == '-':
out_stream = sys.stdout
elif overwrite == 'no' and os.path.exists(outfile):
raise IOError(1, "Output file exists!", outfile)
elif overwrite == 'update' and is_newer(outfile, infile):
raise IOError(1, "Output file is newer than input file!", outfile)
else:
out_stream = open(outfile, 'w')
yield (in_stream, out_stream)
if infile != '-': in_stream.close()
if outfile != '-': out_stream.close()
The previous version was unpleasant since two open file contexts can’t easily be wrapped
in a simple with
statement. Here we’ve used @contextlib.contextmanager
to turn
this function into a proper context.
It properly closes the files, eliminating the out_stream.close()
below.
is_newer¶
def is_newer(path1, path2):
"""Check if `path1` is newer than `path2` (using mtime)
Compare modification time of files at path1 and path2.
Non-existing files are considered oldest: Return False if path1 does not
exist and True if path2 does not exist.
Return None for equal modification time. (This evaluates to False in a
Boolean context but allows a test for equality.)
"""
try:
mtime1 = os.path.getmtime(path1)
except OSError:
mtime1 = -1
try:
mtime2 = os.path.getmtime(path2)
except OSError:
mtime2 = -1
# print "mtime1", mtime1, path1, "\n", "mtime2", mtime2, path2
if mtime1 == mtime2:
return None
return mtime1 > mtime2
get_converter¶
Get an instance of the converter state machine:
def get_converter(data, txt2code=True, **keyw):
if txt2code:
return Text2Code(data, **keyw)
else:
return Code2Text(data, **keyw)
Use cases¶
run_doctest¶
def run_doctest(infile="-", txt2code=True,
globs={}, verbose=False, optionflags=0, **keyw):
"""run doctest on the text source
"""
Allow imports from the current working dir by prepending an empty string to sys.path (see doc of sys.path()):
sys.path.insert(0, '')
Import classes from the doctest module:
from doctest import DocTestParser, DocTestRunner
Read in source. Make sure it is in text format, as tests in comments are not found by doctest:
with open_streams(infile, "-") as files:
(data, out_stream) = files
if txt2code is False:
keyw.update({'add_missing_marker': False})
converter = Code2Text(data, **keyw)
docstring = str(converter)
else:
docstring = data.read()
The issue here is that older Python files may be in non-text format, and we must read the coding comment and then decode the bytes. This means that we (retroactively) should have known to open the file in “b” mode and decode the bytes.
Newer Python files are properly handled by the OS and the io
module.
The coding comment is redundant.
Decode doc string if there is a “magic comment” in the first or second line (http://docs.python.org/reference/lexical_analysis.html#encoding-declarations) If the magic comment matches the encoding, there’s nothing else to do. If the magic comment does not match the encoding, we have issues.
firstlines = ' '.join(docstring.splitlines()[:2])
match = re.search('coding[=:]\s*([-\w.]+)', firstlines)
if match:
docencoding = match.group(1)
if data.encoding == docencoding:
pass
else:
warnings.warn( "data encoding={0} != docencoding={1}".format(data.encoding, docencoding) )
b_string = docstring.encode(data.encoding)
docstring = b_string.decode(docencoding)
Use the doctest Advanced API to run all doctests in the source text:
test = DocTestParser().get_doctest(docstring, globs, name="",
filename=infile, lineno=0)
runner = DocTestRunner(verbose, optionflags)
runner.run(test)
runner.summarize
# give feedback also if no failures occurred
if not runner.failures:
print( "{0:d} failures in {1:d} tests".format(runner.failures, runner.tries) )
return runner.failures, runner.tries
diff¶
def diff(infile='-', outfile='-', txt2code=True, **keyw):
"""Report differences between converted infile and existing outfile
If outfile does not exist or is '-', do a round-trip conversion and
report differences.
"""
import difflib
instream = open(infile)
# for diffing, we need a copy of the data as list::
data = instream.readlines()
# convert
converter = get_converter(data, txt2code, **keyw)
new = converter()
if outfile != '-' and os.path.exists(outfile):
outstream = open(outfile)
old = outstream.readlines()
oldname = outfile
newname = "<conversion of {0!s}>".format(infile)
else:
old = data
oldname = infile
# back-convert the output data
converter = get_converter(new, not txt2code)
new = converter()
newname = "<round-conversion of {0!s}>".format(infile)
# find and print the differences
is_different = False
# print type(old), old
# print type(new), new
delta = difflib.unified_diff(old, new,
# delta = difflib.unified_diff(["heute\n", "schon\n"], ["heute\n", "noch\n"],
fromfile=oldname, tofile=newname)
for line in delta:
is_different = True
print( line, end='' )
if not is_different:
print( oldname )
print( newname )
print( "no differences found")
return is_different
TODO: Use proper with
statements for the open fie contexts.
execute¶
Works only for python code.
Does not work with eval, as code is not just one expression.
def execute(infile="-", txt2code=True, **keyw):
"""Execute the input file. Convert first, if it is a text source.
"""
with open(infile) as source:
if txt2code:
data = '\n'.join(Text2Code(source, **keyw))
else:
data = source.read()
# print( "executing " + infile, "data=", repr(data) )
exec( data )
TODO: Use a proper with
statement for the open file context.
main¶
If this script is called from the command line, the main function will convert the input (file or stdin) between text and code formats.
Option default values for the conversion can be given as keyword arguments to main. The option defaults will be updated by command line options and extended with “intelligent guesses” by PylitOptions and passed on to helper functions and the converter instantiation.
This allows easy customisation for programmatic use – just call main
with the appropriate keyword options, e.g. pylit.main(comment_string="## ")
def main(args=sys.argv[1:], **defaults):
"""%prog [options] INFILE [OUTFILE]
Convert between (reStructured) text source with embedded code,
and code source with embedded documentation (comment blocks)
The special filename '-' stands for standard in and output.
"""
Parse and complete the options:
options = PylitOptions()(args, **defaults)
# print "infile", repr(options.infile)
Special actions with early return:
if options.doctest:
return run_doctest(**options.as_dict())
if options.diff:
return diff(**options.as_dict())
if options.execute:
return execute(**options.as_dict())
Open in- and output streams for processing,
get the converter instance, and
convert and write to out_stream.
The str(converter) usees the special method __str__
, which does the
real work of the conversion.
try:
with open_streams(**options.as_dict()) as files:
(data, out_stream) = files
converter = get_converter(data, **options.as_dict())
out_stream.write(str(converter))
if out_stream is not sys.stdout:
print( "extract written to", out_stream.name )
except IOError as ex:
print( "IOError: {0:s} {1:s}".format(ex.filename, ex.strerror) )
sys.exit(ex.errno)
If input and output are from files, set the modification time (mtime) of the output file to the one of the input file to indicate that the contained information is equal. [5]
try:
os.utime(options.outfile, (os.path.getatime(options.outfile),
os.path.getmtime(options.infile))
)
except OSError:
pass
## print "mtime", os.path.getmtime(options.infile), options.infile
## print "mtime", os.path.getmtime(options.outfile), options.outfile
[5] | Make sure the corresponding file object (here out_stream) is closed, as otherwise the change will be overwritten when close is called afterwards (either explicitly or at program exit). |
Rename the infile to a backup copy if --replace
is set:
if options.replace:
os.rename(options.infile, options.infile + "~")
Run main, if called from the command line:
if __name__ == '__main__':
main()
Open questions¶
Open questions and ideas for further development
Clean code¶
- can we gain from using “shutils” over “os.path” and “os”?
- use pylint or pyChecker to enforce a consistent style?
Options¶
- Use templates for the “intelligent guesses” (with Python syntax for string
replacement with dicts:
"hello {what:s}".format( {'what': 'world'} )
- Is it sensible to offer the header_string option also as command line option?
treatment of blank lines¶
Alternatives: Keep blank lines blank
- “never” (current setting) -> “visually merges” all documentation
if there is no interjacent code
“always” -> disrupts documentation blocks,
“if empty” (no whitespace). Comment if there is whitespace.
This would allow non-obstructing markup but unfortunately this is (in most editors) also non-visible markup.
“if double” (if there is more than one consecutive blank line)
With this handling, the “visual gap” remains in both, text and code source.
Parsing Problems¶
Ignore “matching comments” in literal strings?
Too complicated: Would need a specific detection algorithm for every language that supports multi-line literal strings (C++, PHP, Python)
Warn if a comment in code will become documentation after round-trip?
docstrings in code blocks¶
- How to handle docstrings in code blocks? (it would be nice to convert them
to rst-text if
__docformat__ == restructuredtext
)
TODO: Ask at Docutils users|developers
Plug-ins¶
Specify a path for user additions and plug-ins. This would require to convert Pylit from a pure module to a package…
6.4.3 Packages in Multiple Directories
Packages support one more special attribute, __path__. This is initialized to be a list containing the name of the directory holding the package’s __init__.py before the code in that file is executed. This variable can be modified; doing so affects future searches for modules and subpackages contained in the package.
While this feature is not often needed, it can be used to extend the set of modules found in a package.