Tuesday, September 27, 2011

Hound: Website crawler

Hound is a website crawler i developed a couple of months ago. Today i'm releasing the 0.11 version, which includes some bug fixes and new features.

The crawler starts by crawling a given base URL. It then analyses its html code and searches for other URLs which will be collected and will be enqueued for analysis.

The crawler's behaviour is based on plugins. Different kinds of plugins affect it in a different way. One can, for example, activate certain Filter Plugins which will restrict the URLs the crawler will visit, based on each plugin's behaviour. It could be undesirable, under most circumstances, to allow Hound to visit google, facebook, or youtube. This is why a HostFilter can be used, making the crawler only visit URLs that belong to the base host.

There are different types of plugins, each executed on a certain phase of the crawling session. These can be:
  • Parsers: These are applied to downloaded html in order to normalise it so that the other plugins are able to collect data without worrying about aspects such as which encoding is used, html entities.
  • Crawl filters: These are applied to every URL found. If a crawl filter matches a certain URL, then the latter is discarded. Host, extension and network filters are examples of them.
  • Collect filters: These filters are applied to collected URLs, so that they are not taken into account in the crawling results. Again, you might not want to include google links in the results.
  • Form collect filters: These filters are applied to form tags found in html files.
  • Header filters: These are applied to a downloaded file's headers. They are usually used when filtering mime-types, for example.
  • Collectors: The most important plugins. These take care of analysing the html files, parsing href or src attributes, among others, and feeding the crawler with new URLs.
The file hound.conf contains a list of active plugins, including their arguments. Once you've picked the right configuration, you can start a crawling session by executing:
./hound http://website.to.crawl.com
 The output will be only be written to stdout. If you want to store it in a file, you can do it by using the -o parameter, followed by its path. This will write the results both to stdout and to the given file. If you don't want to write results to stdout, use the -n parameter.

Once the crawl session has ended, you can use hound to parse the results. Run the following command to list the URLs found:
./hound -i /tmp/hound.out -p urls
Where /tmp/hound.out is the output file used during the crawling session. You can always parse the results manually, since they're stored in text files. To list all the form tags found, execute:
./hound -i /tmp/hound.out -p forms
Which will print something like:
0 POST http://blablabla/search cms_search --- hidden +++ query --- text +++ commit --- image
1 POST http://blablabla/contact/send article_id --- hidden +++ subject --- text +++ sender_name --- text +++ sender_mail --- text +++ reset --- reset
The number on the left of each line identifies each form. This is the way hound uses to encode forms. To generate the html code for a given form id, run:
./hound -i /tmp/hound.out -p form:0
Which will print the html code for the first form.

In order to download hound, you can visit the sourceforge project's site. It is open source, developed using python so you can have a look at the code and create new plugins to serve your needs.

No comments:

Post a Comment