A compiler implemented by
TermRewriting of an initial graph constituting the input program, into some lower level language (ultimately into
MachineCode).
[See
RethinkingCompilerDesign for additional context.]
Rules are written `<pattern>`-><replacement> where replacement must result in a new pattern, but may contain guards etc. (everything, that has been defined up to that point). Placeholder use the prefix ^.
Is parsing not a special case of this? How would some
BackusNaurForm grammar rules look in this format? For example does
function_declaration := type function_name ( arglist )
become
`^type ^function_name "(" ^arglist ")"` -> `function_declaration( ^type, ^function_name, ^arglist )`
No: The <pattern> can be only well-formed (sub-)expressions (=terms). But the syntax used in these examples may look quite unstructured. See MinimalParsing for details.
Examples:
`assignType(int)` -> `type:int`;
`int+int` -> `int`;
`assignType(^a+^b)` -> `^(a.type+b.type):(assignType(^a)+assignType(^b))`;
`assignType(^t:^(var:a))` -> a.name.type=t, `^(a.type):^a`;
`assignType(^(var:a))` -> `^(a.name.type):^a`;
`assignType(^a((^p:^b))<-(^r:^x))` -> `^(func(x,b,p,r)):^a`;
`assignType((type:^a):^b)` -> `(^a:assignType(^b))`;
`(type:^a):^b` -> b.type=a, `(^a:^b)`;
`compile(((func(^x,^f,^p,^r):^a)(^p:^b))` -> `^r:compile(replace(^x,^f,^b))`;
`replace(^a,^a,^b)` -> `^b`;
`replace(^a,^b,^c)` -> `^a`;
`compile(^a;^b)` -> `compile(^a),compile(^b)`;
`compile(^(var:a))` -> `asmLoad(a.name.address, a.name.reg)`;
`asmLoad(^addr,^reg)` -> `^(0x800|addr<<10|reg)`
''These are just some sketches for possible rules copied out of my notebook. Some further explanations:
- There are no futher special literals. But there must be a special notation to match any identifier or any number. This is here done with a token name (as used in '^(var:a)'). Oh, it might be possible to represent identifier, strings and number as tokens of bytes and then bits. This would be crude at first, but once appropriate rules are present could be the cleanest way.
- 'a:b' declares 'b' to be of type 'a' (binary operator with strong precedence).
- 'a,b' and 'a;b' evaluate a and then b, return the latter, ',' with moderate precedence, ';' with low (therefore used between rules). There is no such thing as an end-of-statement. Everything is just one large expression.
- 'a(b)' a binary operator, used to denote function call in the above example, that uses mandatory brackets around the second operand.
- 'a=b' assignment operator. This is used as a primitive operator e.g. in 'a.name.type=t' to assign an attribute of the variable 'a'.
I will explain one example in some detail:''
`assignType(^t:^(var:a))` -> a.name.type=t, `^(a.type):^a`
I assume, that the expression representing the whole program (after
MinimalParsing) fed to the rewrite engine is implicitly sorrounded by one extra operator 'doItAll(<expr>)', which is then rewritten to e.g. 'postprocess(compile(assignTypes(preproc(<expr>))))'. 'assignType' is one of these stages, which is reponsible for assigning types to partial expressions.
The rule `assignType(^t:^(var:a))` matches any sub-term, that has the binary operator 'a(b)' as the top opertor with a being the constant 'assignType' and a being the term with the binary operator ':' with any term designated 't' as its left operand and any programm identifier (captured with the special notation 'var') designated 'a' as its right operand.
If this rule matches AND the right hand side succeeds (i.e. evaluates to a pattern), then it is fully replaced with that pattern (with placeholders replaced with the corresponding sub-terms.
In this case the right hand side consists of an assignment and a pattern. The pattern just discared the 'assignType' and completes this stage on this sub-term. The assignment is just a sketch of what should be done to attach the type not to the sub-term/graph, but to the actual 'variable'. In this case 'a.name' should indicate, that we fetch the actual name of the term (which is already known to be an identifier). And 'a.name.type' refers to the type of this name. This glosses of details of
LexicalScoping and incompatible assignments of types to variables. An other example might have read
'symbolTable.lookup(a, a.name).bind(type)'
but that implies just too much about an implementation and could be done any other way anyway.
I am not quite clear that this ad-hoc syntax is really well-suited for the task of defining rewrite rules for the kind of compiler described in
RethinkingCompilerDesign. But it fulfills the following requirements:
The
LanguageMachine relates quite closely to this approach - it applies grammatical rewriting rules which can recognise and substitute arbitrary grammatical patterns that include variable bindings - you can see how it works by looking at the
LmDiagram. The central algorithm is an online algorithm which operates symbol by symbol, and the
LmDiagram can be read as showing what happens as successive symbols are dealt with, and (for a completed analysis) as showing how the substituted material relates to the material that was recognised and to the nesting of recognition phases (the 'ghost of the Parse Tree'). The Parse Tree is sometimes referred to as the
AbstractSyntaxTree. See also
VisitTheParseTree to see what happens if you have to deal with the tree itself instead of a nice helpful ghost.
Relevant papers:
The
CleanLanguage is a "term graph rewriting" language; also pure, lazy, strongly-typed (Milner / Hindley / Mycroft), etc.
CategoryCompilers