DynGenPar
Dynamic Generalized Parser
|
DynGenPar [1] is an innovative parser based on a new principle combining bottom-up and top-down features of traditional parsers. The most unique feature of the algorithm is the possibility to add rules to the grammar at almost any time, even during parsing. DynGenPar has the following characterizing properties:
Dynamic = The grammar is not hardcoded as in usual table-driven approaches, such as (Generalized) LR or Earley's algorithm. Instead, the algorithm works on a runtime representation of the grammar, which allows efficient handling of dynamic grammar changes. To decide when and how to shift or reduce, we use, instead of the usual LR tables, the initial graph, a graph which is easily updated as the grammar changes, along with some runtime top-down information.
Generalized = The algorithm exhaustively parses ambiguous grammars. In addition, epsilon productions are considered in order to support arbitrary CFGs. (Left recursion is naturally supported thanks to the bottom-up nature of the algorithm.) We use graph-structured (DAG-structured) stacks similar to the ones used in Tomita's Generalized LR algorithm. As additional generalizations, we also support Parallel Multiple Context-Free Grammars (PMCFGs) and next token constraints (useful, e.g. for scannerless parsing).
Due to the dynamic design, a parser generator is not needed. Instead, the parser can simply be used as a library.
DynGenPar supports dynamic grammar additions, incremental parsing, prediction, rule labels, custom parse actions, arbitrary token sources, hierarchical parsing, parallel multiple context-free grammars (PMCFGs), and next token constraints. See section Features for details.
All the APIs provided by DynGenPar are in the DynGenPar namespace. You can use
to bring in the entire namespace.
The main class of DynGenPar is DynGenPar::Parser. To do any parsing, you will always have to instantiate at least one object of the DynGenPar::Parser class, or a subclass such as DynGenPar::PgfParser. Parsing is done through the DynGenPar::Parser::parse methods, either the fine-grained version with several arguments or the binding-friendly version which operates on a DynGenPar::ParseState object grouping all the arguments.
The parser operates on a stream of tokens. Those tokens have to be provided by a token source, i.e. a class implementing the DynGenPar::TokenSource interface. DynGenPar provides several ready-made token sources. You can also implement your own: Your class only has to derive from DynGenPar::TokenSource and implement the needed virtual methods, at least the pure virtual DynGenPar::TokenSource::readToken. In the following example, we will use the simplest token source, which operates directly on a list of tokens: the DynGenPar::ListTokenSource.
This is a basic example for how to use DynGenPar::Parser:
The DynGenPar::Match class contains the results of the parsing process; in particular, the parsed tree(s) in a packed, i.e. DAG (directed acyclic graph) -structured, representation, see the DynGenPar::Match::tree field.
The following example shows how to iterate over a parse tree:
You would call the above printParseTree
function using e.g.
on the result of the previous example.
By default, category names, i.e. the DynGenPar::Cat typedef, are strings, which is convenient for debugging, but often not what is wanted in practice. For efficiency and/or interoperability, categories can be forced to be integers instead, by #defining DYNGENPAR_INTEGER_CATEGORIES. Some contexts such as the PGF support (pgf.h), the DynGenPar::FlexLexerTokenSource (flexlexertokensource.h) and the Java bindings always force this option.
Java bindings for DynGenPar based on Qt Jambi are provided. Most of the DynGenPar API is available also to Java developers.
All the APIs provided by DynGenPar are in the concise.DynGenPar
package. You can use
to import the entire package.
The following mappings and restrictions apply:
concise.DynGenPar
package.concise.DynGenPar.Util
class.concise.DynGenPar.Pgf.Token*
concise.DynGenPar.Pgf.PreludeBind
concise.DynGenPar.ByteTokenSource.*
concise.DynGenPar.ParseState
, i.e. DynGenPar::ParseState). The DynGenPar::ParseState::incrementalStacks member cannot be accessed from Java. In particular, only the overloads of DynGenPar::Parser::parse and of the Prediction methods in DynGenPar::Parser which take a DynGenPar::ParseState are available.concise.DynGenPar.Parser.tokens
, and setters, e.g. concise.DynGenPar.Parser.setTokens
. This is because Java only supports public member fields for Java code, it does not support native fields nor a property mechanism that would allow hiding the required function calls to access the C++ class's member field.add
method, and converted to an iterable Java list using a toList
method. (In C++, the toList
method is just an alias for a trivial cast and returns a writable reference to a QList. In Java, due to the required conversion, the returned list can be used for reading only.)This documentation documents the C++ API. With the exception of the above changes, the same interfaces can also be used from Java. In particular, it is transparently possible to implement virtual methods of C++ classes in derived Java classes, and thus interfaces required by the C++ code can also be implemented in Java.
In Java, the basic example for how to use DynGenPar::Parser would look as follows:
The DynGenPar Java bindings are used in the FMathL Concise environment. [2]
This section lists the advanced features supported by DynGenPar.
The most unique feature of DynGenPar is the possibility to add rules to the grammar at almost any time, even during parsing. See DynGenPar::Parser::addRule.
DynGenPar allows operating on its input incrementally, parsing interactively as input is produced and remembering its state. See DynGenPar::Parser::parse and DynGenPar::ParseState.
It is also possible to go back by resuming at a previous saved state.
Possible continuations, i.e. tokens which can come next, can be predicted at any saved incremental parsing state, see DynGenPar::Parser::computePredictions and DynGenPar::Parser::computeConstrainedPredictions.
It is also possible to predict entire "literals", which are sequences of tokens appearing sequentially within a rule, see DynGenPar::Parser::computeMultiPredictions and DynGenPar::Parser::computeConstrainedMultiPredictions. This is particularly useful for scannerless parsing.
CFG rules can have labels, which are reproduced in the produced parse tree. Those labels can be anything which can be stored in a QVariant, including strings, integers and pointers to arbitrary objects. See DynGenPar::Rule::Rule, DynGenPar::Rule::label and DynGenPar::Rule::setLabel.
It is also possible to attach an action to a rule, which will be executed when the rule is matched. Note that actions will currently only be executed for nonempty matches (and in particular, actions on epsilon rules will always be ignored) and that an action will be executed even if the detected match only appears in some of the considered parses and is later discarded due to the input that follows. An action must be a subclass of DynGenPar::Action implementing the pure virtual DynGenPar::Action::execute method. See DynGenPar::Action and DynGenPar::Rule::action.
DynGenPar accepts tokens from any source implementing the DynGenPar::TokenSource interface. Your class only has to derive from DynGenPar::TokenSource and implement the needed virtual methods, at least the pure virtual DynGenPar::TokenSource::readToken.
A number of ready-made token sources are also provided. See DynGenPar::ListTokenSource, DynGenPar::ByteTokenSource and DynGenPar::TextByteTokenSource.
In particular, it is possible to use a lexer generated by Flex as a token source. See DynGenPar::FlexLexerTokenSource.
A token source can return, along with a token, a complete parse (sub)tree to use in lieu of a leaf node in the resulting parse tree, making it possible to call, from your token source, another parser, or another instance of DynGenPar (which is fully reentrant), and to return the result as a single token for the higher-level grammar. See DynGenPar::TokenSource::tree.
PMCFGs (Parallel Multiple Context-Free Grammars) are an extension of context-free grammars used for natural language, e.g. by the Grammatical Framework GF. They are natively supported by DynGenPar. See DynGenPar::Pmcfg, DynGenPar::Parser::loadPmcfg, DynGenPar::Parser::addPmcfgRule and DynGenPar::parseTreeToPmcfgSyntaxTree.
In particular, DynGenPar can import compiled grammars produced by the GF compiler, in the PGF (Portable Grammatical Framework) file format, automatically converting them to PMCFGs in standard form and providing a GF-compatible lexer. See DynGenPar::Pgf and DynGenPar::PgfParser.
Rules can have constraints attached that require (expect) or forbid (taboo) certain tokens following the rule. This is particularly useful for scannerless parsing, where it allows one to implement the usual context-sensitive scannerless parsing primitives such as maximal matches. It can also be used to enforce grammatically correct word sequences, e.g. for the singular indefinite article ("a" vs. "an") in English. See DynGenPar::NextTokenConstraints and DynGenPar::Rule::nextTokenConstraints.
The following desirable features are currently not supported:
(These features may or may not turn out to be required in practice.)
It is planned to add support for these features to the algorithm as needed, without changing the basic structure.
DynGenPar: Dynamic Generalized Parser
Copyright (C) 2010-2013 Kevin Kofler <kevin.kofler@chello.at>
Copyright (C) 2014-2018 DAGOPT Optimization Technologies GmbH, written by Kevin Kofler <kofler@dagopt.com>
Support by the Austrian Science Fund FWF under contract numbers P20631, P23554 and P22239 is gratefully acknowledged.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
The publication [1] constitutes a suitable citation for academic works. Though not required by the software license, it is kindly requested to include that citation in any research papers using this program.