class LibXML::XML::HTMLParser

The HTML parser implements an HTML 4.0 non-verifying parser with an API compatible with the XML::Parser. In contrast with the XML::Parser, it can parse “real world” HTML, even if it severely broken from a specification point of view.

The HTML parser creates an in-memory document object that consist of any number of XML::Node instances. This is simple and powerful model, but has the major limitation that the size of the document that can be processed is limited by the amount of memory available.

Using the html parser is simple:

parser = XML::HTMLParser.file('my_file')
doc = parser.parse

You can also parse documents (see XML::HTMLParser.document), strings (see XML::HTMLParser.string) and io objects (see XML::HTMLParser.io).

Attributes

input[R]

Atributes

Public Class Methods

XML::HTMLParser.file(path) → XML::HTMLParser click to toggle source
XML::HTMLParser.file(path, :encoding → XML::Encoding::UTF_8,
:options => XML::HTMLParser::Options::NOENT) → XML::HTMLParser

Creates a new parser by parsing the specified file or uri.

You may provide an optional hash table to control how the parsing is performed. Valid options are:

encoding - The document encoding, defaults to nil. Valid values
           are the encoding constants defined on XML::Encoding.
options - Parser options.  Valid values are the constants defined on
          XML::HTMLParser::Options.  Mutliple options can be combined
          by using Bitwise OR (|).
# File lib/libxml/html_parser.rb, line 21
def self.file(path, options = {})
  context = XML::HTMLParser::Context.file(path)
  context.encoding = options[:encoding] if options[:encoding]
  context.options = options[:options] if options[:options]
  self.new(context)
end
XML::HTMLParser.io(io) → XML::HTMLParser click to toggle source
XML::HTMLParser.io(io, :encoding → XML::Encoding::UTF_8,
:options → XML::HTMLParser::Options::NOENT
:base_uri="http://libxml.org") → XML::HTMLParser

Creates a new reader by parsing the specified io object.

Parameters:

io - io object that contains the xml to parser
base_uri - The base url for the parsed document.
encoding - The document encoding, defaults to nil. Valid values
           are the encoding constants defined on XML::Encoding.
options - Parser options.  Valid values are the constants defined on
          XML::HTMLParser::Options.  Mutliple options can be combined
          by using Bitwise OR (|).
# File lib/libxml/html_parser.rb, line 45
def self.io(io, options = {})
  context = XML::HTMLParser::Context.io(io)
  context.base_uri = options[:base_uri] if options[:base_uri]
  context.encoding = options[:encoding] if options[:encoding]
  context.options = options[:options] if options[:options]
  self.new(context)
end
XML::HTMLParser.initialize → parser click to toggle source

Initializes a new parser instance with no pre-determined source.

static VALUE rxml_html_parser_initialize(int argc, VALUE *argv, VALUE self)
{
  VALUE context = Qnil;

  rb_scan_args(argc, argv, "01", &context);

  if (context == Qnil)
  {
    rb_raise(rb_eArgError, "An instance of a XML::Parser::Context must be passed to XML::HTMLParser.new");
  }

  rb_ivar_set(self, CONTEXT_ATTR, context);
  return self;
}
XML::HTMLParser.string(string) click to toggle source
XML::HTMLParser.string(string, :encoding → XML::Encoding::UTF_8,
:options → XML::HTMLParser::Options::NOENT
:base_uri="http://libxml.org") → XML::HTMLParser

Creates a new parser by parsing the specified string.

You may provide an optional hash table to control how the parsing is performed. Valid options are:

base_uri - The base url for the parsed document.
encoding - The document encoding, defaults to nil. Valid values
           are the encoding constants defined on XML::Encoding.
options - Parser options.  Valid values are the constants defined on
          XML::HTMLParser::Options.  Mutliple options can be combined
          by using Bitwise OR (|).
# File lib/libxml/html_parser.rb, line 70
def self.string(string, options = {})
  context = XML::HTMLParser::Context.string(string)
  context.base_uri = options[:base_uri] if options[:base_uri]
  context.encoding = options[:encoding] if options[:encoding]
  context.options = options[:options] if options[:options]
  self.new(context)
end

Public Instance Methods

parse → XML::Document click to toggle source

Parse the input XML and create an XML::Document with it’s content. If an error occurs, XML::Parser::ParseError is thrown.

static VALUE rxml_html_parser_parse(VALUE self)
{
  xmlParserCtxtPtr ctxt;
  VALUE context = rb_ivar_get(self, CONTEXT_ATTR);
  
  Data_Get_Struct(context, xmlParserCtxt, ctxt);

  if (htmlParseDocument(ctxt) == -1 && ! ctxt->recovery)
  {
    rxml_raise(&ctxt->lastError);
  }

  rb_funcall(context, rb_intern("close"), 0);

  return rxml_document_wrap(ctxt->myDoc);
}