Parsing in Python

I want a TAP parser in Python, so I tried yeanpypa:

from yeanpypa import *

non_zero_number = AnyOf('123456789') + ZeroOrMore(digit)
rest_of_line    = OneOrMore(NoneOf('\n'))

plan = Literal('1..') + non_zero_number

todo_directive = Optional(' ') + rest_of_line
skip_reason    = ZeroOrMore(NoneOf(' ')) + Literal(' ')
               + rest_of_line
skip_directive = Optional(' ') + Literal('# ')
               + Literal('skip') + Optional(skip_reason)
directive      = Optional(' ') + Literal('# ')
               + (skip_directive | todo_directive)

description = Optional(' ') + Optional('- ')
            + ZeroOrMore(NoneOf('#\n'))
ok_not_ok   = Literal('ok') | Literal('not ok')
test_num    = Optional(' ') + non_zero_number

test = ok_not_ok + Optional(test_num)
     + Optional(description) + Optional(directive)

plan_skipped   = Literal('1..0') + skip_directive
plan_first_tap = plan + ZeroOrMore(Literal('\n') + test)
plan_last_tap  = test + ZeroOrMore(Literal('\n') + test)
               + Optional(Literal('\n') + plan)

tap = plan_skipped | plan_first_tap | plan_last_tap

I guess it works, but it feels like writing a regular expression longhand. I would probably also make it more lax if I wrote it as an re, as it would be easier to write \s*-\s* in the description rule, say.

Previous EntryAdd to MemoriesTell a FriendNext Entry

Comments

untitled

I've been dinking with some parsing lately, and yeah, I think PyParsing looks like regex's with extra noise too. I don't know what to do about it, though.

untitled

In retrospect, and it's probably even mentioned as a design goal, PyParsing let's you define the parser entirely in... Python. So there's no "run this through X tool" step per se, which means that the syntax is pretty well constrained vs. a more suitable one.

untitled

Yeah, I used YAPPS back in the day at school, which has a run-through-this-tool step to compile a separate grammar spec into Python code.

Defined in Python like this, it is just a longhand RE. It’ll be helpful if (like I say below) I can use yeanpypa’s CallbackParser to get a SAX-like view of the document, since I can’t do that with the re module (though I guess I could kind of build that out of re.finditer).

untitled

Looks like (F)Lex to me.. Which, granted, is regular expressions with extra line noise.

untitled

Yeah, seems like all these Python parsing libraries are influenced by the old school C tools for parsing.

untitled

See, when I read that code, in my head it just compiles to an RE anyway.

untitled

Yeah, exactly. One or more of these parsing kits actually compiles your grammar into a real RE. So far it’s really not more powerful at all; all I get out of the posted code is the equivalent of a Python MatchGroup anyhow.

I haven’t looked into yeanpypa’s CallbackParser; hopefully yeanpypa’s CallbackParser is more like SAX parsing, and will let me collect the parts as it goes. That would be worthwhile.

Pyparsing rendition

Pyparsing differs from yeanpypa and regexen in that it assumes whitespace as an implicit delimiter, and leaves it out of the parsed results. Here is my pyparsing version of your TAP parser (you can find the full script at the pyparsing Wiki, http://pyparsing.wikispaces.com/space/showimage/TAP.py):
# newlines are significant whitespace in this parser, so set 
# default skippable whitespace to just spaces and tabs
ParserElement.setDefaultWhitespaceChars(" \t")
NL = LineEnd().suppress()

integer = Word(nums)
plan = '1..' + integer("ubound")

OK,NOT_OK = map(Literal,['ok','not ok'])
testStatus = (OK | NOT_OK)

description = Regex("[^#\n]+")
description.setParseAction(lambda t:t[0].lstrip('- '))

TODO,SKIP = map(CaselessLiteral,'TODO SKIP'.split())
directive = Group(Suppress('#') + (TODO + restOfLine | 
    FollowedBy(SKIP) + 
	restOfLine.copy().setParseAction(lambda t:['SKIP',t[0]]) ))

testLine = Group(testStatus("passed") +
    Optional(integer)("testNumber") + 
    Optional(description)("description") + 
    Optional(directive)("directive")
    )
bailLine = Group(Literal("Bail out!")("BAIL") + 
		    empty + Optional(restOfLine)("reason"))

tapOutput = Optional(Group(plan)("plan") + NL) & \
	    Group(OneOrMore((testLine|bailLine) + NL))("tests")


The quoted strings embedded within the grammar indicate field names that can be used to directly access strings within the parsed results. Here is the code that tests and processes the TAP output:
def tallyResults(results):
    failedTests = []
    skippedTests = []
    todoTests = []
    bonusTests = []
    if results.plan:
	expected = range(1, int(results.plan.ubound)+1)
    else:
	expected = range(1,len(results.tests)+1)
    for i,res in enumerate(results.tests):
	# test for bail out
	if res.BAIL:
	    print "Test suite aborted: " + res.reason
	    failedTests += expected[i:]
	    break
	
	#~ print res.dump()
	testnum = i+1
	if res.testNumber != "":
	    if testnum != int(res.testNumber):
		print "ERROR! test %(testNumber)s out of sequence" % res
	    testnum = int(res.testNumber)
	passed = (res.passed=="ok")
	skipped = todo = False
	if res.directive:
	    skipped = (res.directive[0][0]=='SKIP')
	    todo = (res.directive[0][0]=='TODO')
	if not passed: failedTests.append(testnum)
	if skipped: skippedTests.append(testnum)
	if todo: todoTests.append(testnum)
	if todo and passed: bonusTests.append(testnum)
    
    if failedTests:
	print "Failed tests:", failedTests
    if skippedTests:
	print "SKIPPED:", skippedTests
    if todoTests:
	print "TODO:", todoTests
    if bonusTests:
	print "BONUS:", bonusTests
    if (set(failedTests)-set(todoTests) == set()):
	print "PASSED"
    else:
	print "FAILED"

if __name__ == "__main__":
    test1 = """\
	1..4
	ok 1 - Input file opened
	not ok 2 - First line of the input valid
	ok 3 - Read the rest of the file
	not ok 4 - Summarized correctly # TODO Not written yet
	"""
    test2 = """\
	ok 1
	not ok 2 some description # TODO with a directive
	ok 3 a description only, no directive
	ok 4 # TODO directive only
	ok a description only, no directive
	ok # SKIP only a directive, no description
	ok
	"""
	
    for test in (test1,test2):
	print test
	testResults = tapOutput.parseString(test)
	tallyResults(testResults)
	print

These tests print out:
    
	1..4
	ok 1 - Input file opened
	not ok 2 - First line of the input valid
	ok 3 - Read the rest of the file
	not ok 4 - Summarized correctly # TODO Not written yet
	
Failed tests: [2, 4]
TODO: [4]
FAILED

	ok 1
	not ok 2 some description # TODO with a directive
	ok 3 a description only, no directive
	ok 4 # TODO directive only
	ok a description only, no directive
	ok # SKIP only a directive, no description
	ok
	
Failed tests: [2]
SKIPPED: [6]
TODO: [2, 4]
BONUS: [4]
PASSED

You are absolutely right, pyparsing is mostly an uber-verbose regex. But that verbosity really helps when the time comes to go back and make changes or enhancements to a parser that you wrote months ago.