Glen Pitt-Pladdy :: BlogHTML Processing (to text) with Perl | |||
For a project I am working on I needed to process some HTML pages and extract some information from them. Even being able to turn them into text would just about do the job, but ideally to know a bit about the structure and be able to extract the title, headings, image alt fields etc. After spending a morning searching for a suitable off-the-shelf Perl module and as you would expect these days not getting many results of relevance, I decided to use HTML::TreeBuilder (part of the HTML::Tree distribution) and then walk the tree myself. HTML::TreeBuilderThis is rather a useful class as it processes an HTML page into a tree representing the structure of the elements in the page. In my case I am fetching a URL and then building the tree from there. I used LWP::UserAgent which gives me lots of control but there are many other approaches you could use:
use LWP::UserAgent; At this point we have the html in tree form and ready to be processed further. Each node is an HTML::Element which links to child elements. Walking the treeThere are already some classes around for doing this but it's rather easy and allows a great deal of customisation. This is a simple pre-order traversal. I initially tried a basic stack based algorithm but quickly realised that recursion makes good sense here as it allows keeping track of parent tags and processing child elements in that context much more easily. A skeleton recursion routine would be:
sub walktree { To kick off walking the tree you can do something like this:
my %tagcount; Obviously, @result could be %result or even $result and you can fill it in the routine as is relevant for your application. For example we can make walktree do a basic text extraction into the @result array:
sub walktree { That will leave the text in @result and it can all be put together with something like "join ' ', @result;" or processed further. This isn't aimed at creating a layout or anything, but instead being able to identify the text in a particular field or extracting information like image or link URLs with their context. ContextAmong the things available is @$tagstack and %$tagcount - these can be used to help get context when you are evaluating tags:
print "tagstack: ".join(':',@$tagstack)."\n"; This is a very basic example that shows how to get some context of the element you are processing. |
|||
This is a bunch of random thoughts, ideas and other nonsense, and is not intended to be taken seriously. I'm experimenting and mostly have no idea what I am doing with most of this so it should be taken with cuation and at your own risk. Intrustive technologies are minimised where possible. For the purposes of reducing abuse and other risks hCaptcha is used and has it's own policies linked from the widget.
Copyright Glen Pitt-Pladdy 2008-2023
|