使用 PHP Simple HTML DOM Parser 解析网页-我爱分享网

对于那些有幸在 Twitter 上关注我的人(…)，您可能知道我是一个十足的足球运动员（足球）狂热者。我什至开设了一个单独的 Twitter 帐户来表达我的想法。如果你自己关注足球，你就会知道我们刚刚启动了国际转会窗口，并且有十亿个关于十亿球员前往十亿俱乐部的谣言。这足以让你发疯，但我只是必须知道谁将在下个赛季进入阿森纳和利物浦一线队。

除了所有垃圾报告之外，我遇到的问题是我没有时间在一小时内检查每个网站。 Twitter 是一个很大的帮助，但在这段时间里，没有什么比来自每个俱乐部网站的官方报告更好的了。为了关注这些报告，我使用 PHP Simple HTML DOM Parser 的强大功能编写了一个小的 PHP 脚本，当特定页面更新时，它会向我发送一封电子邮件。

PHP 简单 HTML DOM 解析器

PHP Simple HTML DOM Parser 是同时使用 PHP 和 DOM 的开发人员梦寐以求的实用程序，因为开发人员可以使用 PHP 轻松找到 DOM 元素。以下是 PHP Simple HTML DOM Parser 的一些使用示例：

// Include the library
include('simple_html_dom.php');
 
// Retrieve the DOM from a given URL
$html = file_get_html('https://davidwalsh.name/');

// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e) 
    echo $e->href . '<br>';

// Retrieve all images and print their SRCs
foreach($html->find('img') as $e)
    echo $e->src . '<br>';

// Find all images, print their text with the "<>" included
foreach($html->find('img') as $e)
    echo $e->outertext . '<br>';

// Find the DIV tag with an id of "myId"
foreach($html->find('div#myId') as $e)
    echo $e->innertext . '<br>';

// Find all SPAN tags that have a class of "myClass"
foreach($html->find('span.myClass') as $e)
    echo $e->outertext . '<br>';

// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
    echo $e->innertext . '<br>';
    
// Extract all text from a given cell
echo $html->find('td[align="center"]', 1)->plaintext.'<br><hr>';

正如我之前所说，这个库是寻找元素的梦想，就像早期的 JavaScript 框架和选择器引擎一样。具备使用 PHP 从 DOM 节点选择内容的能力后，就可以分析网站的变化了。

脚本

以下脚本检查两个网站的更改：

// Pull in PHP Simple HTML DOM Parser
include("simplehtmldom/simple_html_dom.php");

// Settings on top
$sitesToCheck = array(
					// id is the page ID for selector
					array("url" => "http://www.arsenal.com/first-team/players", "selector" => "#squad"),
					array("url" => "http://www.liverpoolfc.tv/news", "selector" => "ul[style='height:400px;']")
				);
$savePath = "cachedPages/";
$emailContent = "";

// For every page to check...
foreach($sitesToCheck as $site) {
	$url = $site["url"];
	
	// Calculate the cachedPage name, set oldContent = "";
	$fileName = md5($url);
	$oldContent = "";
	
	// Get the URL's current page content
	$html = file_get_html($url);
	
	// Find content by querying with a selector, just like a selector engine!
	foreach($html->find($site["selector"]) as $element) {
		$currentContent = $element->plaintext;;
	}
	
	// If a cached file exists
	if(file_exists($savePath.$fileName)) {
		// Retrieve the old content
		$oldContent = file_get_contents($savePath.$fileName);
	}
	
	// If different, notify!
	if($oldContent && $currentContent != $oldContent) {
		// Here's where we can do a whoooooooooooooole lotta stuff
		// We could tweet to an address
		// We can send a simple email
		// We can text ourselves
		
		// Build simple email content
		$emailContent = "David, the following page has changed!\n\n".$url."\n\n";
	}
	
	// Save new content
	file_put_contents($savePath.$fileName,$currentContent);
}

// Send the email if there's content!
if($emailContent) {
	// Sendmail!
	mail("david@davidwalsh.name","Sites Have Changed!",$emailContent,"From: alerts@davidwalsh.name","\r\n");
	// Debug
	echo $emailContent;
}

代码和注释是不言自明的。我已经设置了脚本，以便在许多页面更改时收到一个“摘要”警报。脚本是最难的部分——为了执行脚本，我设置了一个 CRON 作业以每 20 分钟运行一次脚本。

此解决方案并非专门用于监视足球——您可以在任意数量的网站上使用此类脚本。然而，这个脚本在所有情况下都有些简单。如果您想监视一个具有极其动态代码的网站（即代码中有时间戳），您可能需要创建一个正则表达式，将内容隔离到您正在寻找的块中。由于每个网站的构造都不同，我将由您来创建特定于页面的隔离器。不过，在网站上玩得开心……如果你听到好的、可靠的足球谣言，一定要告诉我！

开放的编程资料库

使用 PHP Simple HTML DOM Parser 解析网页

PHP 简单 HTML DOM 解析器

脚本

感觉很棒！可以赞赏支持我哟~

相关推荐

热门标签

近期文章

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

回顶部