Lagarto Parser
To parse HTML/XML content with Lagarto you must do just two steps:
- create a Lagarto parser instance by providing the HTML content; and then
- invoke parser using the implementation of
TagVisitor
:
In other words:
LagartoParser lagartoParser = new LagartoParser(htmlContent, false); TagVisitor tagVisitor = new FooTagVisitor(); lagartoParser.parse(tagVisitor);
That is it!
Specifications and compatibility
Processing HTML, in general, has two phases:
- parsing HTML - to recognize tokens from the input content; and
- building DOM tree - create and organize DOM tree from the tokens.
Lagarto parser, as its name say, performs the first task. Parsing is done strictly by the HTML5 specification! We put huge effort to add all the parsing rules. So you can use Lagarto for checking if the syntax of some HTML is valid.
Lagarto parser follows HTML5 rules for parsing HTML into tokens.
Lagarto extension can also build a DOM tree; you can read more about this on the following pages.
Events
While content is parsing, Lagarto calls various callback methods of
TagVisitor
. Here is the list of most important callbacks; check the
javadoc for more details.
start() and end()
Invoked before and after content is parsed.
text(CharSequence)
Invoked on plain text.
comment(CharSequence)
Invoked on HTML comment. Argument contains just comment content, without comment boundaries.
tag(Tag)
Invoked on a HTML tag: open, close or empty tag. Argument is a Tag
instance, that contains further information about the tag: tag name,
attributes, depth level and so on.
Tag
instance is reused during visiting HTML code for better
performances! The same instance is passed to all callback methods.
script(Tag, CharSequence)
Invoked on all script tags. Passed is a script Tag
instance and the
body.
See javadoc for other callback methods. Also, you can use
EmptyTagVisitor
to write your own visitors in more convenient ways.
Emit strings while parsing
LagartoParser
emits all the text as CharSequence
. By default, we are
using CharBuffer
for CharSequence
implementation. If the CharSequence
interface works for you, then you are set - and you will have great performances.
However, sometimes CharSequence
API is not enough, or you simply need strings.
For example, to build our DOM tree we store all texts as strings. To get strings,
we call toString()
on char sequences (implemented by char buffers). Overall
that is less performant - because char buffer is created first and then a string.
If you know that you gonna need strings, you can enable option to emit String
as implementation for CharSequence
(instead of CharBuffer
).
Then calling toString()
will have no performance penalties.
To recap: emitStrings
flag determines the implementation of CharSequence
:
CharBuffer
of String
, so you can have the best performances depending on
your usage.
Errors in HTML
Lagarto is very error-friendly. It will try to do its best to resolve
an error and continue parsing. Every error is reported by invoking
error()
method of a visitor. By default, errors are written in the log
as warnings.
By default, Lagarto will append the file offset where the error occurs. Optionally, by enabling property, Lagarto can calculate the real position of the error in the file: it's line and column.
The format of error position is the following: [line:column @offset]
.
Please note that error is not exactly at the calculated character, but
somewhere near.
Writer and adapter
Lagarto may be used not only for parsing and analyzing HTML content,
but also for modifying it. For that purpose you can use TagWriter
and
TagAdapter
.
TagWriter
is a simple TagVisitor
that writes content to some
Appendable
. So, if you pass TagWriter
instance to parse()
method
the resulting code will be the same as the input HTML code.
TagAdapter
is a generic adapter for underlying TagVisitor
. It may be
used over some e.g. TagWriter
to perform some HTML transformation.
This approach allows to perform HTML modifications during just one parsing! For example, if you have 3 different adapters that modifies HTML code in some way, they all can be applied during just single content parsing.
LagartoParserEngine
The core parsing engine is implemented in an abstract class:
LagartoParserEngine
. It's purpose is to be used for creating custom
or specific parsers. LagartoParser
is just simple generic parser built
on top of this parsing engine.
LagartoServletFilter
One common use of Lagarto is as servlet filter. There it may parse and
modify the content before it is written to the output stream. For this
purpose there is LagartoServletFilter
, an abstract filter that can be
overridden.
SimpleLagartoServletFilter
LagartoServletFilter
is quite generic filter and can be used by any
other parsing tool or solution. Anyhow, in most cases, the
SimpleLagartoServletFilter
will be enough. Here user just has to
override one method to build custom LagartoParsingProcessor
. Inside
the implementation, user builds set of nested TagAdapter
s and pass it
to the method invokeLagarto()
.
public class AppLagartoServletFilter extends SimpleLagartoServletFilter { @Override protected LagartoParsingProcessor createParsingProcessor() { return new LagartoParsingProcessor() { @Override protected char[] parse( TagWriter rootTagWriter, HttpServletRequest request) { TagAdapter1 tagAdapter1 = new FooTagAdapter(rootTagWriter); TagAdapter2 tagAdapter2 = new BarTagAdapter(tagAdapter1, request); char[] content = invokeLagarto(tagAdapter2); return content; } }; } }
In above example we created Lagarto servlet filter with two content processors (i.e. tag adapters) that execute during the same single parsing operation. In other words, HTML content is parser only once, but the content is modified by two different processors. In some cases this may save time and prevent from double parsing!