Parsing
To aid in creating extensible languages, the compiler library also contains a parser that is made to be easily extensible, alongside with the syntax language for defining the grammar used by the parser. As with previous sections of the compiler library, this part of the documentation focuses on the interfaces themselves. The conceptual parts are described in the corresponding section of the language reference.
Note: The parser described here is powerful and is able to parse all context-free languages. This does, however, come at a cost in complexity and performance. For application code that parses simple input formats, using the parser library, which provides simpler but more performant parsers for such tasks.
Source Positions
An important part of creating a language is to be able to refer back to the source code to report
problems. To achieve this, it is necessary to have a representation of a location in the source
code. The value core.lang.SrcPos
has this role. It stores a source position in the form
of a core.io.Url
that refers to a file, and two integers that refer to a range of
characters in the file. For the purposes of the integers, a text file is thought to consist of an
array of codepoints, and line-endings are represented as a single codepoint (\n
) regardless of how
they are stored in the file. The integers are indices into this array of codepoints.
Instances of SrcPos
are produced by the parser to indicate source locations. They are also
typically found in parse trees and abstract syntax trees to properly report errors.
The SrcPos
class has the following members:
-
init(core.lang.SrcPos& other)
Copy constructor.
-
init()
Create an unknown position.
-
init(core.io.Url file, core.Nat start, core.Nat end)
Create.
-
core.Maybe<core.io.Url> file
The file. May be
null
in case the file is not known. -
core.Nat start
Start position in the file.
-
core.Nat end
End position in the file.
-
core.Bool any()
Any data?
-
core.Bool empty()
Any data?
-
core.lang.SrcPos extend(core.lang.SrcPos other)
Merge with another range.
-
core.lang.SrcPos firstCh()
Get an SrcPos that represents only the first character in the range.
-
core.lang.SrcPos lastCh()
Get an SrcPos that represents only the last character in the range.
Parsers
The compiler library provides a generic and powerful parser that is able to parse all context-free
grammars. Due to its tight interaction with grammars from the name tree, this parser is always
executed on the Compiler
thread. Since the parser is powerful, it is comparatively expensive to
set up to parse strings. This is generally fine when parsing source code, but might not be desirable
for interactive applications, or for parsing file formats. The former case is addressed by the
stormname:core.lang.InfoParser and the latter is addressed by the parser
library.
The Main Parser
The main parser is implemented as the actor core.lang.Parser<T>
. The parameter T
is the name of
a rule in the name tree that shall be used as the starting point in the grammar. The information
from this rule is also used to provide a type-safe way of extracting the parse tree, which later
allows executing syntax transforms in a type-safe manner.
The parser initially only considers productions that are in the same package as the starting rule to
be visible. To consider productions, more can be added using the addSyntax
function. The
addSyntax
function automatically takes exported packages into account. This mechanism is used by
Basic Storm to provide extensible syntax.
The parser class has the following members:
-
init()
Creates the parser.
-
void addSyntax(core.lang.Package pkg)
Make all productions in
pkg
visible to the parser. -
core.Bool parse(core.Str str, core.io.Url file)
Parse the string
str
using the currently visible grammar. Returnstrue
if a match was found, andfalse
otherwise. Note that a match always starts from the beginning of the string, but may end before the end of the string. To ensure that the entire string was matched, usehasError
or inspectmatchEnd
after parsing. The parser always attempts to match as much as possible of the input.The parameter
file
is used to createSrcPos
instances in the parse tree and for reporting errors. -
core.Bool parse(core.Str str, core.io.Url file, core.Str.Iter start)
Like above, but starts parsing at
start
rather than at the beginning of the string. -
void clear()
Clears all parse-related information. Included packages are, however, retained.
-
core.Bool hasError()
Check if an error message is available. If
matchEnd
does not refer to the end of the string, this is always true (since the error could be "unexpected character X" if nothing else). -
core.Bool hasTree()
Check if the parse was successful, and that we can produce a parse tree. This is equivalent to the return value of
parse
. -
core.Str.Iter matchEnd()
Get an iterator to the end of the match.
-
core.lang.SyntaxError error()
Get the current syntax error, if
hasError
returned true. -
void throwError()
Throw the current syntax error, if any.
-
T tree()
Create and return the parse tree from the match. Assumes that
hasTree
returns true. Throws otherwise. Note thatT
is the same as the parameter that represents the start of the parse. -
lang.bnf.InfoNode infoTree()
Create and return an info tree that describes the current match.
The Info Parser
In certain situations it is useful to be able to dynamically specify the starting point of a parse
(e.g. in the language server). This is allowed by the actor core.lang.InfoParser
. It
provides a slightly different interface to the same parser as described above. All members are the
same, except that tree
is missing since it is not possible to achieve in a type-safe manner. It is
thus only possible to extract info trees from the info parser.
Note that since the parser in Storm needs to create parse-tables to parse efficiently, it is preferred to re-use the same parser instance for as long as possible. This makes the info parser beneficial in situations like the language server. Since it is possible to modify the start production, it means that large portions of the parse tables can be re-used, even with different starting points in the grammar.
The info parser has the following members in addition to the generic parser:
-
init(lang.bnf.Rule rootRule)
Create an info parser.
-
void root(lang.bnf.Rule rule)
Set the root rule of the parse after creation.
-
lang.bnf.Rule root()
Get the current root rule of the parse.
-
lang.bnf.InfoErrors parseApprox(core.Str str, core.io.Url file)
Parse a string using error recovery. Returns an
InfoErrors
object that describes how "bad" the match was. There are also overloads where the starting position is specified, and that provides anInfoInternal
node that is used as a state for the non-context free parts of the grammar. -
lang.bnf.InfoNode fullInfoTree()
Return an info tree that is guaranteed to match to the end of the parsed string, regardless of the length of the match.
Parse Trees
All nodes in a parse tree are derived from the actor lang.bnf.Node
. The type defined by
each rule inherits from Node
. Then types for individual productions inherit from the type defined
from the rules. As such, all nodes in a parse tree have the following members in common:
-
init()
Base class for all syntax nodes. This class will be overridden by 'Rule' to add the 'transform' function and then by 'Production' to implement the 'transform' function for that specific option.
-
init(core.lang.SrcPos pos)
Base class for all syntax nodes. This class will be overridden by 'Rule' to add the 'transform' function and then by 'Production' to implement the 'transform' function for that specific option.
-
core.lang.SrcPos pos
Start position for this node.
-
core.Array<lang.bnf.Node> children()
Get all direct children in this node as an array.
-
core.Array<lang.bnf.Node> allChildren()
Get all children (both direct and indirect) as a flat array.
-
core.Array<lang.bnf.Node> allChildren(core.lang.Type type)
Get all children of a specific type as a plain array.
Note that the transform
function is defined in a derived class.
Info Trees
As mentioned in the language reference, an info tree is a representation of the parse tree that includes all matches, even those not captured in the grammar. It is therefore useful for syntax highlighting and indentation.
Since the info tree contains matches for the entire input string, it aims to be compact to avoid excessive memory usage. As such, the representation is not designed to be modified after creation (apart from re-linking the tree). Since the info tree is complete, it is possible to re-construct the input string from the info tree.
The info tree consists of three actors: lang.bnf.InfoNode
, which is a generic node,
lang.bnf.InfoInternal
, which is an internal node, and lang.bnf.InfoLeaf
,
which is a leaf node.
Info Nodes
The lang.bnf.InfoNode
contains the interface common for both child- and internal nodes.
It has the following members:
-
lang.bnf.TokenColor color
Specified color of this node.
-
core.Nat length()
Length of this match (in codepoints).
-
core.Bool error()
Error occured during parsing of this node?
-
core.Bool delimiter()
Does this node represent a node captured using a delimiter in the grammar?
-
core.Maybe<lang.bnf.InfoLeaf> leafAt(core.Nat pos)
Find the first leaf node with a non-zero length at position
pos
relative to this node.pos
is a number of codepoints in the input string. -
lang.bnf.TextIndent indentAt(core.Nat pos)
Find the indentation of characters at offset 'pos'. This returns either an absolute number of indentation levels, or another position which indicates that the indentation should be the same as the indentation on that line.
-
core.Str format()
Format this info node into a human-readable representation.
-
core.Maybe<lang.bnf.InfoInternal> parent()
Get our parent.
Internal Nodes
The lang.bnf.InfoInternal
is an internal node in an info tree. It thus corresponds to
matching a production in the text. It contains the following additional members:
-
core.Maybe<lang.bnf.InfoIndent> indent
Information about indentation.
-
core.Maybe<lang.bnf.Production> production()
Get our production.
-
lang.bnf.InfoNode [](core.Nat id)
Get child at offset.
Leaf Nodes
The lang.bnf.InfoLeaf
is a leaf node in an info tree. It corresponds to matching a
single regular expression to the input, and contains the actual string that was matched. Since error
recovery might have been used to parse the input, information about the regex may not always be
present, and the contained string may not always match the regex.
-
core.Maybe<lang.bnf.RegexToken> matches()
Get the matching regex.
-
core.Bool matchesRegex()
Does the content of this node match the regex in here? Returns 'false' if this node does not contain a regex.