Create a PHP web crawler or scraper in 5 minutes
Utilizing the PHP programming language we show you how to create an infinitely extendable web crawler in under 5 minutes, collecting images and links.
The Crawler Framework
First we need to create the crawler class as follows:
<?php
class Crawler {
}
?>We then will create methods to fetch the web pages markup, and to parse it for data that we are looking at collecting. The only public methods will be getMarkup() and get() as the parsing methods will generally be used privately for the crawler, however the visibility is set to protected since you never know who will want to extend its functionality.
<?php
class Crawler {
protected $markup = '';
public function __construct($uri) {
}
public function getMarkup() {
}
public function get($type) {
}
protected function _get_images() {
}
protected function _get_links() {
}
}
?>Fetching Site Markup
The constructor will accept a URI so we can instantiate it such as new Crawler('http://vision-media.ca'); which then will set our $markup property using PHP's file_get_contents() function which fetches the sites markup.
<?php
public function __construct($uri) {
$this->markup = $this->getMarkup($uri);
}
public function getMarkup($uri) {
return file_get_contents($uri);
}
?>Crawling The Markup For Data
Our get() method will accept a $type string which essentially will simply be used to invoke another method actually doing the processing. As you can see below we construct the method name as a string, then make sure it is available so now developers can utilize this simply by invoking $crawl->get('images');
We set visibility for _get_images() and _get_links() to protected so that developers will use our public get() method rather than getting confused and trying to invoke them directly.
Each protected data collection method simply uses the PCRE (Perl Compatible Regular Expressions) function preg_match_all() in order to return all tags within the markup that are accepted using our patterns of /<img([^>]+)\/>/i and /<a([^>]+)\>(.*?)\<\/a\>/i. For more information on regular expressions visit http://en.wikipedia.org/wiki/Regular_expression
<?php
public function get($type) {
$method = "_get_{$type}";
if (method_exists($this, $method)){
return call_user_method($method, $this);
}
}
protected function _get_images() {
if (!empty($this->markup)){
preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);
return !empty($images[1]) ? $images[1] : FALSE;
}
}
protected function _get_links() {
if (!empty($this->markup)){
preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links);
return !empty($links[1]) ? $links[1] : FALSE;
}
}
?>Final PHP Web Crawler Code And Usage
<?php
class Crawler {
protected $markup = '';
public function __construct($uri) {
$this->markup = $this->getMarkup($uri);
}
public function getMarkup($uri) {
return file_get_contents($uri);
}
public function get($type) {
$method = "_get_{$type}";
if (method_exists($this, $method)){
return call_user_method($method, $this);
}
}
protected function _get_images() {
if (!empty($this->markup)){
preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);
return !empty($images[1]) ? $images[1] : FALSE;
}
}
protected function _get_links() {
if (!empty($this->markup)){
preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links);
return !empty($links[1]) ? $links[1] : FALSE;
}
}
}
$crawl = new Crawler('http://vision-media.ca');
$images = $crawl->get('images');
$links = $crawl->get('links');
?>
Delicious
Digg
StumbleUpon
Reddit
Facebook
Comments
Thanks for this awsome tutorial. I searched a long time till i found this great tut about programming a php crawler!
How could i make this spider to store only the sites with "dofollow" atribute ?
Thanks a lot for sharing this.
I'll try out this code. This help me to learn both OOP and pattern matching too!
I wrote a simple crawler about a month ago that dealt with going through the directories, opening/reading the files, filtering the files for PHP and HTML code, then putting that into a database. It didn't turn out so well.
But this crawler....WOW. Simple yet concise! It gives the post-processed PHP, along with the associated links (now to just make the function recursive with the links, and I'm all set)! No filtering, no directory finding, no hard "if..then" functions....just simplicity.
Very nice! Thank you for providing this crawler example!
I admire the thought process gone into this...but you will get a lot further faster with perl + www::Mechanize + html::TreeBuilder.
Once a result is fetched you can do simple things like $h->look_down("alt", undef), which would deliver you an array of all tags where you have an alt attribute but it's empty (ie '').
Didn't work on my host, but it gives me enough idea to start writing up my own script.
Here is a recursive perl version of your code
http://codediaries.blogspot.com/2009/11/how-to-write-simple-recursive-we...
Hi all,
i've pasted the code into an index.php file but i've got the following timeout also if i change url to scan.
Warning: file_get_contents(http://www.corriere.it) [function.file-get-contents]: failed to open stream: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. in C:\xampp\htdocs\test\crawler\index.php on line 11
Fatal error: Maximum execution time of 60 seconds exceeded in C:\xampp\htdocs\test\crawler\index.php on line 11
Someone can help me?
Thanks
Add this to the end of the Script to see the contents of the fetched array...Nice tutorial.
print_r($crawl);
print_r($links);
now i can web scrape your captcha question and post from my bot! just kidding. : )
your script works nicely.
Hi there I'm utilizing php 5 and this code doesn't seem to work, I'm tiring the mods outlayed by Anonymous. Is there any updated code for this scraper? thanks.
I used your script and I cannot see any results. When I load it into a browser there is nothing. What am I doing wrong?
Here is my version of your code:
<?php
class Crawler {
protected $markup='';
public function __construct($uri){
$this->markup = $this->getMarkup($uri);
}
public function getMarkup($uri) {
$ch = curl_init();
$timeout = 5;
curl_setopt ($ch, CURLOPT_URL, $uri);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$contents = curl_exec($ch);
curl_close($ch);
return $contents;
}
public function get($type){
$method = "_get_{$type}";
if (method_exists($this, $method)){
return call_user_method($method, $this);
}
}
protected function _get_images(){
if(!empty($this->markup)){
preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);
return !empty($images[1]) ? $images[1] : FALSE;
}
}
protected function _get_links(){
if(!empty($this->markup)){
preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links);
return !empty($links[1]) ? array_flip(array_flip($links[1])) : FALSE;
}
}
protected function _get_pagetitle() {
if (!empty($this->markup)){
preg_match_all('/(.*?)\<\/title\>/si', $this->markup, $pagetitles); // si for multi line
return !empty($pagetitles[1]) ? $pagetitles[1] : FALSE;
}
}
}
$crawl = new Crawler('http://www.ritacaz.com');
$images = $crawl->get('images');
$links = $crawl->get('links');
?>
<?phpecho($links);
?>
Here are a few function changes and additions to help out with some of those errors
public function getMarkup($uri) {
$ch = curl_init();
$timeout = 5;
curl_setopt ($ch, CURLOPT_URL, $uri);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$contents = curl_exec($ch);
curl_close($ch);
return $contents;
}
I built a crawler and to speed things up i used this to get unique URL in my Array:
protected function _get_links() {
if (!empty($this->markup)){
preg_match_all('/markup, $links);
return !empty($links[1]) ? array_flip(array_flip($links[1])) : FALSE;
}
}
To grab a page title (sometimes Title tags are on multiple lines so i added the "s")
protected function _get_pagetitle() {
if (!empty($this->markup)){
preg_match_all('/
return !empty($pagetitles[1]) ? $pagetitles[1] : FALSE;
}
}
Tutorial is very good.
but i would like to capture the screen shot for that links/urls. is it possible?
Thanks,
Nag.
This crawler collects images and links, is ok,
but I not see this, why?
I should to send in database and after extracted in my page for I can see?
how to do? you can detail please? thanks.
Best Regards,
Floriano
What error are you getting? might be a PHP version issue, I believe method visibility was 5.x, not sure.
Hello
Not working for me crawler. Why?
I have not changed anything in the source code, we loaded on the server without any changes.
I have to change something for me to see that working?
I want to extract the currency exchange and work every 3 minutes.
If can you help?
Thank you for everything.
Best Regards,
Floriano
Hi,
This is very helpful.We can extend the crawling.For that first we have to crawl a website which has got a number of external links(say Yahoo.com) and now we have a plenty of URIs and thus we can crawl recursively until it reaches a web site with no external links.
Thx for the crash course !
I got impressed as a beginner. thanks
Nice tutorial... love it...
Or to use Ruby Hpricot, which is far more robust than any PHP DOM parser I have run across :)
A faster, effective and more powerful solution in 5 minutes is to use the PHP DOM.
Just what I was looking for
Good article, holds the basic foundation of a search engine spider, would need enhancing to look for header tags and perhaps weight the results with scores but all in all very good tutorial