1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
| <?php
header('Content-Type: text/html; charset=UTF-8');
//<!--meta http-equiv="Content-Type" content="text/html; charset=utf-8" /-->
set_time_limit(0);
$sUrl = 'http://www.echoroukonline.com/ara/editorial/index.1.html';
$sUrlSrc = getWebsiteContent($sUrl,0);
// Load the source
$dom = new DOMDocument("UTF-8");
@$dom->loadHTML($sUrlSrc);
$xpath = new DomXPath($dom);
// =================================== step 1 - links:
$vRes = $xpath->query("/html/body/div/div[2]/div/div[2]/div[4]/div/div/div/h2/a");
// =================================== step 2 - titles:
$aLinks = $vRes->item(0)->getAttribute("href");
echo "<br />aLinks : ".$aLinks."<br />";
$sUrl2 = 'http://www.echoroukonline.com/ara/'.$aLinks;
echo "<br />sUrl2 : ".$sUrl2."<br />";
$sUrlSrc2 = getWebsiteContent($sUrl2,1);
@$dom->loadHTML($sUrlSrc2);
$xpath = new DomXPath($dom);
// =================================== step 3 - titles:
$vRes = $xpath->query(".//*[@id='article_holder']/h1");
$aTitles= $vRes->item(0)->nodeValue;
// =================================== step 4 - Metas:
$vRes = $xpath->query(".//*[@class='article_metadata']");
$aMetas= $vRes->item(0)->nodeValue;
//==================================== step 5 - descriptions:
$vRes = $xpath->query(utf8_encode(".//*[@id='article_body']"));
$aDescriptions= $vRes->item(0)->nodeValue;
//=============================
echo '<link href="css/styles.css" type="text/css" rel="stylesheet"/><div class="main">';
echo '<h1>Using xpath for dom html</h1>';
//echo "<br />".$aTitles."<br />".$aMetas."<br />".$aDescriptions."<br />";
echo "
<div class='unit'>
<a href='{$sUrl2}'>{$aTitles}</a>
<div>{$aMetas}</div>
<div>{$aDescriptions}</div>
</div>";
echo '</div>';
// this function will return page content using caches (we will load original sources not more than once per hour)
function getWebsiteContent($sUrl,$f=0) {
// our folder with cache files
$sCacheFolder = 'cache';
if(!is_dir($sCacheFolder)){
mkdir($sCacheFolder,0777);
}
// cache filename
if ($f == 0) {
$sFilename = 'ech-'.date('YmdHi').'.html';
} else {
$sFilename = 'eftch-'.date('YmdHi').'.html';
}
if (!file_exists($sCacheFolder."/".$sFilename)) {
$ch = curl_init($sUrl);
$fp = fopen($sCacheFolder."/".$sFilename, 'w');
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_HTTPHEADER, Array('User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15'));
curl_close($ch);
fclose($fp);
}
//return file_get_contents($sCacheFolder.$sFilename);
return file_get_contents_utf8($sCacheFolder."/".$sFilename);
}
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)
);
}
?> |
Partager