Syntax
Syntax is a central part of Storm. The Storm compiler provides a parser that reads syntax from the type system. This system is then used to implement other languages, like Basic Storm. The syntax definitions in the compiled are designed so that it is easy to choose a set of packages and combine their syntax, and in that way extend the language you are using.
This section will not discuss how syntax rules are written, please refer to BNF Syntax to learn how to use the built in syntax definition language.
Definition
The compiler implements a parser for context free grammars (with a few extensions, as we will see), and therefore rules look a lot like regular BNF grammar. Grammars in Storm consist of a set of rules. Each rule consists of a number of alternative matches, which are called options in Storm. Each of these options contains a number of tokens. A token is either a reference to another rule that should be matched, or a regular expression.
This is slightly different compared to regular BNF syntax. The BNF syntax does not have a separate
entry for options, but instead includes the alternative operator |
to accomplish the same
goal. Storm provides the separate concept of options to allow rules to be easily extended from other
parts of the system, and to make the semantics of constructing a syntax tree clearer.
Storm syntax also diverges slightly from BNF by allowing the use of repetitions. Each option may contain zero or one repetition. Currently, the repetitions present in regular expressions are supported, i.e. zero or one, zero or more and one or more times.
Semantics
When parsing a string, using core.lang.Parser<X>
, the caller provides the rule that the parser
should start from. The parser starts from there and finds a valid parse tree that matched the given
string, or returns an error. The parser tries to match as far as possible in the given source
string, and returns how much of the source string successfully matches the input. This allows the
user of the parser to decide if a complete match is an error or not.
Some times, grammars are ambiguous. The Storm parser helps you to specify which of the many valid parses should be chosen by looking at the priority that is assigned to each option. This priority is an integer, and higher numbers have priority. Using these priorities, the storm parser tries to emulate the behavior of Parsing Expression Grammar, where higher priorities are executed "first". Since the Storm parser does not work the same way, there may be differences in some cases, but in return Storm allows left-recursive rules.
Regular expressions are always greedy, which means that they sometimes interact in unexpected ways with the rest of the tokens. This is most clearly illustrated by the following example:
A ::= "a*" "a" B ::= "a"* "a"
In this case, the rule A
uses two different regular expressions to match a sequence of at least
one a
. However, this rule will never succeed, since the regular expression "a*"
always matches
greedily, leaving nothing left for the final "a"
. The rule B
, on the other hand, works as
expected since we are instead instructing the parser to match the regular expression zero or more
times. This is not greedy.
Output
Each rule and option declared in Storm live as a type in the type system. Therefore, the root rule
when parsing is specified as a parameter to the Parser
type like: core.lang.Parser<Root>
. Rules
generate types with the same name as the rule, and options generate either anonymous types or named
types depending on how they are declared. These types are used to represent the syntax tree from a
parse. The types representing each option inherit from the type representing the rule, and in that
way the syntax tree can be conveniently represented in the type system.
By calling the tree
member on the Parser
object, you create a syntax tree that can be explored
and manipulated from any Storm language. Each node contains variables corresponding to the
declaration of the rule. Exact names of these variables depend on the language that were used to
declare the syntax. The type of each variable corresponds to the rule that was matched, and if the
match is inside a repetition block, it could be Array<T>
or Maybe<T>
as well. Regular
expressions are represented using the core.lang.SStr
type in the syntax tree.
Transform
To conveniently transform the syntax tree into another representation, each type generated by rules
and options provide the transform
function. The transform
function is automatically generated
according to how the rules were declared. Each option overloads the rule's version with the specific
code required for that option.
Each option has a function call declared, that tells the parser how to construct that part of the
syntax tree when that option is matched. Each of the tokens in the option may also be bound to a
variable and passed on to the function. These variables are evaluated by calling the matched
sub-rule whenever the value of the variable is needed. Regular expressions may also be bound to
variables, in which case a Str
is created. It is also possible to mark a part of the tokens in the
option, and bind these to a variable as a Str
. The start position of the rule is bound to the
variable pos
.
Aside from the function call, it is also possible to execute member functions on the return value
before it is returned. This is useful when tokens are within a repetition syntax. In this case, the
function call may create an array or other similar container, and instruct the parser to call the
push
member of the array for each item in the repetition syntax. When the result is created it is
bound to the variable me
in case it is needed.
To ease the development of transforms, syntax rules may also be declared to take a number of parameters. These parameters are treated as regular variables.
Everything in the syntax is resolved lazily. This means, for example, that any rules you refer to
does not need to be declared until the rule is actually considered by the parser. This can be
convenient if you know that the user will always include another package when using your syntax, or
something similar. The same holds true for function calls. The actual function and overload to be
called is not decided until the call is about to be made. This means that function calls does not
necessarily have to be valid if they are never used, and that it is possible to express some kind
of overloads that are not possible in other, more static, languages. For example, a rule could call
the function a(x)
, where x
can be either a A
or a B
. In this case, we can declare two
versions of a()
: a(A x)
and a(B x)
, even though A
and B
are not related. In other
languages it is necessary to declare it like a(Object x)
, and then manually check which type was
actually sent to a
. Another convenient usage of this is that it is possible to indicate that
options should not return anything by specifying them returning the function call void
(or
anything, really). Since the function call will never be used unless it actually has to be used, the
parser will work without problems as long as no one tries to get the result from the rule.
Parsing and threading
At the moment, the parser is always executed on the Compiler
thread. When creating the syntax
tree, only classes and actors are supported. Furthermore, actors have to be declared to run on the
compiler thread, since the parser does not yet implement support for message passing. This
restriction will probably be lifted in the future.