# DMS Patterns and Rewrite Rules

The DMS Software Reengineering Toolkit allows a "domain" (language) engineer describe/specify languages of interest quickly and accurately, so that she may spend most of her attention on a specific program analysis or transformation of interest.

At DMS Domain Specifications, we give some background on the necessary elements to build a DMS language processing domain. We have already described how to easily specify a grammar to DMS, which automatically generates a parser and syntax tree builder, and tree-to-text prettyprinter.

On this page, we show how DMS can work with parameterized code fragments,
written, for readability, using the *(concrete) syntax of such a specified grammar*
In particular, we show

*patterns*(for matching existing, or creating (*instantiating*) new code)- and
*rewrite rules*(for transforming code)

*Rule Specification Language (RSL)*.

DMS tools are implemented as *metaprograms* that
invoke various aspects of DMS, such as pattern matching
or rewriting. *It is important to remember that these activities all occur on
compiler data structures and not on raw text;
for patterns and rewriting, this is largely (Abstract) Syntax Trees.
This ensures that analyses and transforms are easier to build, run faster at scale,
and are much more reliable than some ad hoc text-string manipulation.*

We use Nicholas Wirth's Oberon language for examples, as it is a real, practical language yet simple enough so we can provide interesting pattern and rewrite examples.

# Pattern Concepts

A DMS pattern allows the domain engineer to specify a code fragment to be used for either
pattern-matching against DMS ASTs (acting like *if you see this...*), or
to be used to instantiate ASTs that match the fragment (acting like *instantiate this*), or both.
Sophisticated tools built using DMS may use many patterns, each representing
a code structure of interest. (Patterns are also used to build rules; see below).

Such patterns are written using the surface syntax (e.g, the lexical and grammar rules)
of a specifically chosen *domain* (notation) of interest (e.g., Java, SQL, Verilog). Because
DMS is multi-lingual, in order for such a pattern to be interpreted, the domain must
be explicitly specified (usually as a *language~dialect*):

default pattern domain PHP~PHP5;

A single DMS tool may have patterns (and rewrite rules) written for several languages that the tool has been configured to handle; the domain for each pattern (or rule) must be specified before the pattern (or rule) is defined.

A DMS pattern has the following form (free format):

patternpattern_name(pattern_variable1:syntax_category1, ...pattern_variableN:syntax_categoryN) :pattern_syntax_category= "pattern_body" ifadditional_condition_on_pattern_match;

The elements of the pattern are:

- keyword
**pattern** - the
*pattern_name*, referenceable by other patterns, rules, or by the rest of a DMS application - optional
*pattern_variable*s, specifying pattern variable names and their corresponding*syntax_category*. Syntax categories are defined by the left-hand-side tokens of the DMS BNF grammar rules for the currently selected pattern domain. - the
*pattern_syntax_category*that the entire pattern (metaquoted text) satisifies *metaquoted*(double-quoted) pattern text (*pattern_body*). The*pattern_body*is a valid fragment of text from the currently specified domain, that is an instance of the specified pattern syntax category. The pattern body may include references to pattern variables*metaescaped*with a prefix**\**, and other metaescaped pattern expressions.- an optional
**if***additional_condition_on_pattern_match*which provides additional conditions based on the values of the matched parameters.

The *pattern_body* is essentially what will be matched or instantiated,
depending on the pattern usage, and normally contains a code fragment,
written in the syntax from the target language of interest ("surface syntax").
The metaquotes around the *pattern_body*
are needed to delimit the pattern code fragment in the target language
from the RSL pattern-language definition syntax, because
the pattern (rule) language generally has very different syntax from the target domain.

The *pattern_body* is interpreted as specification for a tree fragment with pattern-variable leaves
using the same parsing engine that reads the target source code.
The DMS pattern matcher can find where this tree fragment matches a parse tree; this is not a string-match.
The DMS instantiation engine can construct an instance of this tree given the AST subtree
bindings for the pattern variables,
thereby providing a way to build an arbitrary tree fragments from (possibly multiple) patterns.

A *meta-reference* to another pattern may be written in the *pattern_body*.
Such a reference looks like:

\pattern_name\(pattern_arg1\,...pattern_argN\)

This is in effect a subpattern function call with parameters consisting of additional

*pattern_arg*s. (Note the metaescaped

**\(**,

**\,**and

**\)**syntax, to distinguish from similar syntax in the target domain). One may write domain-text or other (recursively!) additional meta-references in the

*pattern_arg*s. Such meta-references act as if the other named pattern were incorpated into the referencing pattern at the point of reference. Such metareferences can pass metavariables to enable matching with consistent bindings. This allows one to write extremely complex patterns in chunks which are easy to understand, and for a set of complex patterns to share building-block subpatterns.

## Pattern Conditionals

Conditions are boolean expressions that allow checking
whether bound pattern variables have fixed syntactic properties, or represent
something that can be transformed to a canonical value by a set of rewrites (see below),
or some arbitrarily complex semantic property possibly
derived by inspecting other parts of the tree matched by a pattern,
or inspecting auxiliary data structures (e.g., symbol tables, control flow, data flow,
special inference procedures). *A key idea is that conditionals
connect analyzers implemented in other parts of DMS to the patterns and rewrite rules*.

One might write a series of increasingly sophisticated patterns to match
useless **while** statements. This pattern:

pattern useless_while(b:body) : statement = " while false do \b"

is equivalent to:

pattern useless_while(c:condition, b:body) : statement = " while \c do \b" if c == "false";

However, this second pattern actually succeeds in finding a tree with any conditional in a while statement,
but then checks that (pattern-match bound) condition is equal to a literal constant **false**.
A generalization of the (above purely syntactic) check) which is more "semantic" would be

pattern useless_while(c:condition, b:body) : statement = " while \c do \b" if c\|/ AlgebraSimplifications == false ;

This pattern if the matched condition **c** can be rewritten (**\|/**) using
a set of rewrite rules called AlgebraSimplifications (not shown here),
to match **false**. Even more "semantic" would be:

pattern useless_while(c:condition,t:then_clause) : statement = " while \c then \t" if can_determine_is_false(c);

which uses a custom (PARLANSE-coded) predicate **can_determine_is_false** that consults other DMS-derivable results to decide the instantiated condition is always **false** at the point where it is found (the matched tree c carries its source position with it, so the context is implied). One might build control and data flow for the desired language, and use the resulting dataflow to determine values that arrive at the condition c, which may then always turn out to be "false" in a nontrivial way.

## Some Patterns for Oberon

Here we provide some example patterns that could be useful in matching/generating Oberon code (grammar for reference).

default pattern domain Oberon~Oberon07; default condition domain Oberon~Oberon07; pattern float_declaration(x: ident): VariableDeclarationList = " \x: FLOAT ; "; pattern double(f: Factor): Term = " \f * 2 "; pattern optimizable_divide(f: Factor): Term = " \f / \f " if is_not_zero(f); pattern assignment_plus_1(lhs: ident, e: SimpleExpression): Statement = " \lhs := \e + 1 "; pattern if_variable_equals_conditional_statement(i: ident, t: Term, s:Statement) :Statement = " IF \i = \t THEN \s END "; pattern procedure(i: ident, p: FormalParameterList, b: StatementSequence) : ProcedureDeclaration = " PROCEDURE \i ( \p ) ; BEGIN \b END \i "; pattern special_algorithm(x: ident, y: ident): StatementSequence = " WHILE \y > 0 DO \y := \y-1; \x := \x + \double\(\y\); END \assignment_plus_1\(\y\,\x\) "; external condition is_not_zero(f: Factor) = 'Misc/Oberon_IsNotZero';

Pattern **float_declaration** specifies a declaration statement
that declares some identifier as a float-type.
**double** defines an expression multiplied by the constant two.
**optimizable_divide** finds a quotient that might be simplified (to one),
because the numerator is *identical* to the denominator,
and verifies that the divisor is not a zero value
by inspecting the tree "around" the bound pattern variable.
The fact that the same pattern variable **\f** is mentioned twice in the pattern,
causes the pattern matcher to check that all instances have the same exact tree shape.
**assignment_plus1** describes a statement that assigns a value, plus one, to a target variable.
**if_variable_equals_conditional_statement** describes a statement executed under
a condition that checks a variable for equality to some value.
**procedure** describes a rather abstract form of a procedure; all the pieces need to be filled in.
Patterns can be composed; **special_algorithm** shows a complex set of statements,
that mix surface syntax with invocations of other patterns. Condition **is_not_zero**
is a custom analzer that inspects its tree to verify that the value of that tree
is "not zero" in the context from which the tree was taken.

A DMS tool may use a pattern to locate code of interest in any AST it has
acquired. The tool metaprogram may call a DMS library function,
**Pattern:Match** with a reference to the
**optimizable_divide** pattern, and a reference to the AST for this code:

y := 1; x := 3; q := (y-1)/(y-1); p := (x-4)/(x-4);

The matcher will attempt to match the

**(y-1)/(y-1)**term, but the additional semantic check

**is_not_zero**(implemented by the domain engineer) will determine that

**y-1==0**, violating the semantic condition on the pattern. The matcher will find the

**(x-4)/(x-4)**term, and the analyzer will decide that

**x-4<0**, so it is not zero, and the pattern will match, binding the metavariable

**t**to the subtree for

**x-4**. The DMS tool will obtain a reference to the

**divide**node as the location of the match, and a reference to the

**-**("subtract") node as the binding of the pattern variable. (Since there are two matching subtrees, the first (leftmost) one is returned).

A DMS tool may use a pattern to synthesize code, too. If the tool
metaprogram calls the DMS library function
**Pattern:Instantiate** with the pattern **special_algorithm**
and trees for **p** and **q** for pattern variables **x** and **y**
respectively, the instantior will build an DMS AST that would prettyprint as:

WHILE q > 0 DO q := q-1; p := p + q * 2 END q := p + 1

That result tree in turn may be used as a parameter to another pattern, to assemble a larger tree.
In this way an entire program can be generated from "templates" that is syntactically correct.
If the metaprogram then instantiates pattern **procedure** with that tree for the parameter
**b**, with trees **foobar** and **( q: integer )** for parameters **i** and **f**
respectively, the instantiator will build an AST that would prettyprint as:

PROCEDURE foobar(q: integer); BEGIN WHILE q > 0 DO q := q-1; p := p + q * 2 END q := p + 1 END

Assembling trees in this manner allows the code generator to build them in any order that makes sense, not just left-to-right. Futhermore, building ASTs allow the DMS Rewriting machinery to be applied to assembled trees, enabling post-tree-composition optimization.

# Rewrite Rule Concepts

DMS also provides the ability to encode source-to-source transformations ("rewrites") using the surface syntax of the source and target (if different than source) languages, of the form, *if you see this, replace it by that*. These rewrites are applied to the parse trees to provide revised trees; a prettyprinter ("anti-parser") provided by DMS can then regenerate source code for the target language.

DMS rewrite rules take the form (free format):

rulerule_name(pattern_variable1:syntax_category1, ...pattern_variableN:syntax_categoryN) :pattern_syntax_category->replacement_syntax_category= "pattern_body" -> "replacement_body" ifadditional_condition_on_pattern_match;

The first part of the rule is essentially a pattern (*if you see this*),
and have virtually identical syntax and semantics. Rewrite rules add a replacement part
(*... then replace it by that*). Rewrite rule elements are:

- keyword
**rule** - the
*rule_name*, referenceable by other patterns, rules, or by the rest of a DMS application - syntax
*pattern_variable*s, specifying pattern variable names and a corresponding*syntax_category*. - the
*pattern_syntax_category*that the*pattern_body*satisifies - The token
**->**seperating*pattern_syntax_category*from*replacement_syntax_category*. This token is read as "(category) will be replaced by", and occurs here in sympathy with a second**->**token in the rule. - the
*replacement_syntax_category*that the*replacement_body*satisifies *metaquoted*pattern text (*pattern_body*), including references to pattern variables*metaescaped*with a prefix**\**, and other metaescaped pattern expressions- A second
**->**token between*pattern_body*and*replacement_body*, read as "(content) will be replaced by". *metaquoted*pattern text (*replacement_body*), including references to pattern variables*metaescaped*with a prefix**\**, and other metaescaped pattern expressions.- a optional
*additional_condition_on_pattern_match*which provides an additional check based on the value of matched parameters.

When transforming from one langauge to the same language, the *pattern_syntax_category*
and the *replement_syntax_category* are almost always identical; you can't replace
a "declaration" by an "expression" and end up with a syntactically well-formed program.
When transforming form one language to a different language, the syntax categories
(and their semantics) across the two languages can vary wildly.

What a DMS rewrite rule does, when applied to a specific place on an AST, is pattern-match
against that AST. If the match fails, the rewrite fails (and tells this to the tool metaprogram;
it may be able to use that knowledge). If the match succeeds, the metavariable are bound
to appropriate places in the subtrees of the AST exactly as plain pattern matching would do,
and the the *replacement_body* is instantiated, as described under pattern matching above.

As with patterns, meta-pattern references found in the *replacement_body* cause the specified pattern
to be instantiated with the arguments provided, at the point referenced.
Again this allows construction of very complex rewrite rules, using other subpatterns as
building blocks.

Individual rules can also be grouped into a **ruleset**, which can then
be used as a group.

## Applications of Rewrite rules

Since the rewrites transform one tree into another tree, one can
apply an arbitrarily long/complex set of transformation rules
(usually one or more **ruleset**s!)
repeatedly to the entire tree. This gives DMS rewrites the power of a
Post (string [generalized to tree]) rewriting system, thus Turing
capable. *You can technically produce any arbitrary transformation of
the original tree with sufficient transforms.* Usually one uses other
features of DMS to make writing the transforms a bit easier; for
instance, a rewrite rule may consult the result of a particular
attribute grammar computation in order to "easily" use information
from "far away in the tree" (individual rewrites rules always have a
fixed, maximum "radius").

DMS Rewrite rules can also apply to associative and commutative operators. This is extremely hard to do with pure source-to-source rewrites; as one issue, you would have to specify these laws, and then somehow prevent the rewrite engine from getting stuck repeatedly applying the commutative law. Instead, the DMS Rewrite engine has built-in associative and commutative application machinery that handle this. You write the basic algebraic law, you decorate the grammer rules which have these properties, and DMS applies your rules automatically. This is very handy when writing expression simplification procedures.

DMS provides a lot of additional support machinery, to help one construct control flow graphs and/or compute dataflow with efficient parallel solvers; DMS rewrites can consult these artifacts also. We note that one can also fall back when needed on procedural metaprogramming to implement transforms not easily expressed by rewrites, such as LR table generation. DMS allows both source-to-source and procedural rewrites to mix easily and arbitrarily. (Most program transformation systems offer some type of source-to-source rewriting capability based on pattern (tree) matching/replacement somewhat like DMS, but most of them have no fall back for procedural rewrites. We think this is a fundamental mistake. There are other source-to-source systems such as Clang, that offer only procedural rewrites, but not pattern matching/rewrites. We this this is also a mistake.)

If one limits oneself to rewrites that exactly tile the original AST one gets "syntax-directed translation".

Here's a set of possible applications of rewrite rules:

- Expression simplification (both numeric and boolean)
- Dead code removal
- Code optimization
- Aspect insertion
- Refactoring tools
- Code instrumentation
- Refining abstract concepts into concrete code
- Code generation from models
- Language translation/Migration
- Modernization: Replacement of legacy techology constructs or APIs by modern constructs
- Fixing field-width errors (Y2K or similar) or other pervasive broken coding idiom
- Semantically deep code obfuscation (by scrambling control and data flow)
- Reverse engineering concrete code to abstractions

## Some Rewrite Rules for Oberon

The following rules can be used to simplify Oberon code (grammar for reference):default base domain Oberon~Oberon07; -------------------------------------------------------------------------------------- rule simplify_boolean_not_not(e: Factor): Factor -> Factor = " ~ ~ \e " -> " \e "; rule simplify_boolean_and_true(e: Factor): Term -> Term = " \e & TRUE " -> " \e "; rule simplify_boolean_and_false(e: Factor): Term -> Term = " \e & FALSE " -> " FALSE " if no_side_effects("\e"); rule simplify_boolean_and_self(e: Factor): Term -> Term = " \e & \e " -> " \e " if no_side_effects("\e"); rule simplify_boolean_or_true(e: Term): SimpleExpression -> SimpleExpression = " \e OR TRUE " -> " TRUE " if no_side_effects("\e"); rule simplify_boolean_or_false(e: Factor): SimpleExpression -> SimpleExpression = " \e OR FALSE " -> " \e "; rule simplify_boolean_or_self(e: Term): SimpleExpression -> SimpleExpression = " \e OR \e " -> " \e " if no_side_effects("\e"); -------------------------------------------------------------------------------------- rule reduce_strength_exponentation(e: Expression):Factor -> Term = " power(\e,2) " -> " (\e) * (\e) " if no_side_effects(e); rule reduce_strength_multiplication(e: Factor):Term -> SimpleExpression = " \e * 2 " -> " \e + \e " if no_side_effects("\e"); -------------------------------------------------------------------------------------- rule simplify_if_true_stmt(ss: StatementSequence) : StatementSequence -> StatementSequence = " IF TRUE THEN \ss END" -> " \ss "; rule simplify_if_false_stmt(ss: StatementSequence): StatementSequence -> StatementSequence = " IF FALSE THEN \ss END " -> " ; "; rule simplify_if_true_stmt_stmt(ss1: StatementSequence, ss2: StatementSequence) : StatementSequence -> StatementSequence = " IF TRUE THEN \ss1 ELSE \ss2 END " -> " \ss2 "; rule simplify_if_false_stmt_stmt(ss1: StatementSequence, ss2: StatementSequence) : IfStatement -> StatementSequence = " IF FALSE THEN \ss1 ELSE \ss2 END " -> " \ss2 "; rule simplify_if_cond_empty_stmt(c: Expression, ss: StatementSequence) : Statement -> Statement = " IF \c THEN ; ELSE \ss END " -> " IF ~(\c) THEN \ss END "; rule simplify_if_cond_stmt_empty(c: Expression, ss: StatementSequence) : Statement -> Statement = " IF \c THEN \ss ELSE ; END " -> " IF \c THEN \ss END "; -------------------------------------------------------------------------------------- rule remove_empty_statement(ss: StatementSequence): StatementSequence -> StatementSequence = " \ss ; " -> " \ss " ; -------------------------------------------------------------------------------------- rule zombie_removal_enablement(config_variable:ident): Factor->Factor = " \config_variable " -> " FALSE " if is_dead_feature(config_variable); -------------------------------------------------------------------------------------- external condition no_side_effects(e: Expression) = 'Misc/Oberon_NoSideEffects'; external condition is_dead_feature(cv: ident) = 'Misc/Oberon_IsDeadFeature'; public ruleset GeneralSimplifications = { simplify_boolean_not_not, simplify_boolean_and_true, simplify_boolean_and_false, simplify_boolean_and_self, simplify_boolean_or_true, simplify_boolean_or_false, simplify_boolean_or_self, reduce_strength_exponentation, reduce_strength_multiplication, simplify_if_true_stmt, simplify_if_false_stmt, simplify_if_true_stmt_stmt, simplify_if_false_stmt_stmt, simplify_if_cond_empty_stmt, simplify_if_cond_stmt_empty, remove_empty_statement, }; public ruleset RemoveDeadCode = GeneralSimplifications + zombie_removal_enablement;

Note that some general simplification rules are grouped into a single ruleset **GeneralSimplifications**.
When applied, this ruleset simplifies boolean conditions where possible,
and then removes or cleans up **IF** statements where boolean expression
collapse to **TRUE** or **FALSE**. A metaprogram might use this ruleset by itself to clean up Oberon code.

An additional ruleset **RemoveDeadCode** is defined that includes (**+**)
the simplification rules and the special zombie handling rule.
This set of rules is designed to do *dead code removal*.
The **zombie_removal_enablement** rule converts a configuration variable
to **FALSE** if an "oracle" **dead_feature**
(controlled by the software engineer) says that the configuration variable
will henceforth always be false. This can trigger a wave of boolean simplifications
and then conditional statement removals, from the general simplifications.
The effect is to delete features no longer used by application (*zombie code*).

Note that the **simplify_boolean_...** rules only express
rewrites over very simple expessions, and not even the commutative variant. DMS's
associative/commutative rewrite engine handles the rest. One can
write more rules to handle numeric expression simplifications, including constant folding.

Note the semantic conditional **no_side_effect**; this is
additional DMS metaprogramming provided by the tool engineer, not shown here.

The seemingly peculiar rule **remove_if_false** rewrites
a useless if statement to a trivial statement " **;** " rather
than to some kind of "empty statement" because Oberon's syntax does
not allow the latter. The additional rule
**remove_empty_statement** makes the trivial statement
vanish when it is found in a set of other statements.
There are obvious extensions of the **IF** statement simplifications to other
Oberon control constructs.

More complex transformations are often implemented as a set of individual helper transformations such as these, combined by metaprogramming to guide the transformation process, and to handle context issues (such as scopes, dataflows, ...)

Given the following Oberon program (actual DMS input):

MODULE UserFunctionModule; PROCEDURE user_function( key: string; h: DB_connection); VAR s: STRING; BEGIN; IF feature1 THEN s := READ(h,k) IF desired_property1(s) THEN WRITE(h,k,revise1(s)) END END IF feature2 THEN s := READ(h,k) IF desired_property2(s) & feature1 THEN WRITE(h,k,revise2(s) ELSE WRITE(h,k,revise3(s) END END END user_function; END UserFunctionModule.

and applying the transforms above with **dead_feature(feature1)==true**
(intending to remove zombie functionality **feature1**)
results in the following Oberon program (actual DMS output):

MODULE UserFunctionModule; PROCEDURE user_function( key : string; h : DB_connection); VAR s : STRING; BEGIN; IF feature2 THEN s := READ(h, k) WRITE(h, k, revise3(s) END END user_function; END UserFunctionModule.

The useless (zombie) feature and its support has been eliminated, automatically. Back to DMS overview