Glen Pitt-Pladdy :: Blog

HTML Processing (to text) with Perl

For a project I am working on I needed to process some HTML pages and extract some information from them. Even being able to turn them into text would just about do the job, but ideally to know a bit about the structure and be able to extract the title, headings, image alt fields etc.

After spending a morning searching for a suitable off-the-shelf Perl module and as you would expect these days not getting many results of relevance, I decided to use HTML::TreeBuilder (part of the HTML::Tree distribution) and then walk the tree myself.

HTML::TreeBuilder

This is rather a useful class as it processes an HTML page into a tree representing the structure of the elements in the page. In my case I am fetching a URL and then building the tree from there. I used LWP::UserAgent which gives me lots of control but there are many other approaches you could use:

use LWP::UserAgent;
use HTML::TreeBuilder;

# get UserAgent up
my $ua = LWP::UserAgent->new;
# Set the UA
$ua->agent ( 'Some name for your agent' );
# Optionally you can configure a proxy
$ua->proxy ( ['http', 'https', 'ftp'], 'http://proxy.domain.tld:3128/' );
my $data = $ua->get( 'http://www.pitt-pladdy.com/blog/_20120225-154058_0000_HTML_Processing_to_text_with_Perl/' );
print $data->code."\n";
my $htmltree = HTML::TreeBuilder->new_from_content ( $data->decoded_content() );

At this point we have the html in tree form and ready to be processed further. Each node is an HTML::Element which links to child elements.

Walking the tree

There are already some classes around for doing this but it's rather easy and allows a great deal of customisation. This is a simple pre-order traversal. I initially tried a basic stack based algorithm but quickly realised that recursion makes good sense here as it allows keeping track of parent tags and processing child elements in that context much more easily.

A skeleton recursion routine would be:

sub walktree {
    my ( $element, $tagstack, $tagcount, $result ) = @_;
    push @$tagstack, $element->tag();
    if ( ! exists $$tagcount{$element->tag()} ) { $$tagcount{$element->tag()} = 0; }
    ++$$tagcount{$element->tag()};
    # TODO deal with this tag here TODO
    my @children = $element->content_list ();
    if ( @children ) {
        # further child elements below this
        foreach ($element->content_list()) {
            if ( ref $_ ) {
                walktree ( $_, $tagstack, $tagcount, $result );
            } else {
                # must be text
                # TODO deal with this tag here TODO
            }
        }
    }
    --$$tagcount{pop @$tagstack};
}

To kick off walking the tree you can do something like this:

my %tagcount;
my @tagstack;
my @result;
walktree ( $htmltree, \@tagstack, \%tagcount, \@result );

Obviously, @result could be %result or even $result and you can fill it in the routine as is relevant for your application. For example we can make walktree do a basic text extraction into the @result array:

sub walktree {
    my ( $element, $tagstack, $tagcount, $result ) = @_;
    push @$tagstack, $element->tag();
    if ( ! exists $$tagcount{$element->tag()} ) { $$tagcount{$element->tag()} = 0; }
    ++$$tagcount{$element->tag()};
    # deal with this tag here
    if ( $element->tag() eq 'img' ) {
        push @$result, $element->attr ( 'alt' );
    }
    my @children = $element->content_list ();
    if ( @children ) {
        # further child elements below this
        foreach ($element->content_list()) {
            if ( ref $_ ) {
                walktree ( $_, $tagstack, $tagcount, $result );
            } else {
                # must be text
                if ( $element->tag() ne 'script'
                    and $element->tag() ne 'style' ) {
                    push @$result, $_;
                }
            }
        }
    }
    --$$tagcount{pop @$tagstack};
}

That will leave the text in @result and it can all be put together with something like "join ' ', @result;" or processed further.

This isn't aimed at creating a layout or anything, but instead being able to identify the text in a particular field or extracting information like image or link URLs with their context.

Context

Among the things available is @$tagstack and %$tagcount - these can be used to help get context when you are evaluating tags:

print "tagstack: ".join(':',@$tagstack)."\n";
if ( $$tagcount{'a'} ) { print "within link\n"; }

This is a very basic example that shows how to get some context of the element you are processing.

Comments:




Are you human? (reduces spam)
Note: Identity details will be stored in a cookie. Posts may not appear immediately