Crawler les pages d'un domaine !

**degseb** · 25/05/2011, 16h08

Bonjour,

J'essaye de mettre en place pour le plaisir un "générateur de sitemap" ... pour ce faire je dois crawler toutes les pages d'un site web.

Pour crawler une page ... pas de souci

, par contre pour crawler les liens trouvés dans cette dernière c'est déjà plus chaud

Et je là je cale un peu...

Voici mon page :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
 
<?php 
include("crawler.php"); 
$mycrawler=new Crawler(); 
$url='http://www.sebastiendegreve.com/'; 
//crawl d'une première page
$link=$mycrawler->crawlLinks($url); 
//crawl de tout les liens contenu dans la première page
foreach($link['link'] as $value) {
	//on vérifie que ce n'est pas un lien externe
	//if (preg_match($url,$value) > 0){
		// on crawl
		$mycrawler->crawlLinks($value);
		if (in_array($value,$link)){
			$link2=$mycrawler->crawlLinks($value);
		}
	//}
}
 
echo "<table width=\"100%\" border=\"1\"> 
  <tr> 
    <td width=\"30%\"><div align=\"center\"><b>Link Text </b></div></td> 
    <td width=\"30%\"><div align=\"center\"><b>Link</b></div></td> 
    <td width=\"40%\"><div align=\"center\"><b>Text with Link</b> </div></td> 
  </tr>"; 
for($i=0;$i<sizeof($link['link']);$i++) 
{ 
echo "<tr> 
    <td><div align=\"center\">".$link['text'][$i]."</div></td> 
    <td><div align=\"center\">".$link['link'][$i]."</div></td> 
    <td><div align=\"center\"><a href=\"".$link['link'][$i]."\">".$link['text'][$i]."</a></div></td> 
  </tr>";         
 
}   
echo "</table>"; 
 
 
echo "<table width=\"100%\" border=\"1\"> 
  <tr> 
    <td width=\"30%\"><div align=\"center\"><b>Link Text </b></div></td> 
    <td width=\"30%\"><div align=\"center\"><b>Link</b></div></td> 
    <td width=\"40%\"><div align=\"center\"><b>Text with Link</b> </div></td> 
  </tr>"; 
for($i=0;$i<sizeof($link2['link']);$i++) 
{ 
echo "<tr> 
    <td><div align=\"center\">".$link2['text'][$i]."</div></td> 
    <td><div align=\"center\">".$link2['link'][$i]."</div></td> 
    <td><div align=\"center\"><a href=\"".$link2['link'][$i]."\">".$link2['text'][$i]."</a></div></td> 
  </tr>";         
 
}   
echo "</table>";
?>

et ma page fonction (crawler.php):

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
 
<?php 
Class Crawler 
{ 
    var $curl; 
    function __construct() 
    {         
        $this->curl= curl_init(); 
    } 
    function getContent($url) 
    { 
        curl_setopt($this->curl, CURLOPT_URL, $url);     
        curl_setopt ($this->curl, CURLOPT_RETURNTRANSFER, 1); 
        $content=curl_exec ($this->curl);     
        return $content; 
    } 
 
    function hasProtocol($url) 
    {             
        return strpos($url,"//");         
    } 
    function getDomain($url) 
    { 
        return substr($url,0,strrpos($url,"/")); 
    } 
    function convertLink($domain,$url,$link) 
    { 
 
        if($this->hasProtocol($link)) 
        { 
            return $link; 
        }         
        elseif (($link=='#')||($link=="/")) 
        {             
            return $url;             
        }         
        //else if((strpos($link,'/'))==0) 
                else if(substr($link,0,1)=="/") 
        { 
            return $domain.$link;             
 
        } 
        else  
        { 
            return $domain."/".$link;             
        } 
 
    } 
    function crawlLinks($url) 
    { 
        $content=$this->getContent($url); 
        $domain=$this->getDomain($url); 
        $dom = new DOMDocument(); 
        @$dom->loadHTML($content);         
        $xpath = new DOMXPath($dom); 
        $hrefs = $xpath->evaluate("//a");       
        for ($i = 0; $i < $hrefs->length; $i++)  
        { 
            $href = $hrefs->item($i);                                    
            $links['link'][$i]=$this->convertLink($domain,$url,$href->getAttribute('href')); 
            $links['text'][$i]=$href->nodeValue;         
        } 
        return  $links;   
    }  
} 
?>

Vous pouvez voir que dans ma page principale, j'ai commenté certaines lignes, je voulais eviter que le crawler, crawl des liens externes, mais je me suis rendu compte que preg_match n'autorisait pas les backslash o_O .

Comment pourrais-je crawler récursivement toutes les pages de mon domaine, sans en sortir et sans crawler plusieurs fois les mêmes liens ?

Merci de votre aide

!

Ps : vous pouvez voir le résultat du crawl sur l'index de sebastiendegreve.com ici sebastiendegreve.com/crawler2/exemple.php

Ps2: l'affichage en tableau n'est là que pour la phase de test

Crawler les pages d'un domaine !

Langage PHP

Discussions similaires

Partager

Partager