Question

我正在为我的网站创建一个工具，以查看他们在Google上对不同关键字的定位。

现在，我想收集这部分源代码：

<a href="http://www.test.com/" class=l onmousedown="return clk(this.href,'','','','1','','0CBoQFjAA')">Linktitle in Google!</a>

问题是preg_match OR preg_match_all函数与“onmousedown”或“this.href”或链接的“1”部分不匹配。这正是我需要的部分......

有没有人知道为什么会这样，更重要的是......如何解决这个问题？

我使用的代码很明显..我甚至尝试使用“/ onmousedown /”或“/ \'1 \'/”但它没有帮助。

非常感谢!!!!

Answer 1

除了刮取Google的道德和可能的法律含义之外，您不应该使用正则表达式来提取HTML的部分内容。正则表达式不是为解析HTML而设计的，并不具备特定的语法。

尝试使用HTML解析器，例如DOMDocument。它旨在解析HTML / XML。

Answer 2

根据谷歌的说法，您不能废弃他们的网站。

他们的robot.txt在这里：http://www.google.com/robots.txt。

也就是说，这个公司的整体商业模式是刮掉其他人的网站，这有点虚伪。

考虑自己警告。

正则表达式很简单：

<a [^<]*class=l.*?</a>

现在，对于声称无法使用正则表达式解析HTML的人...是的，你是对的，你不能在正则表达式中解析 html。但是，这里不要太荒谬。

使用已知格式从HTML页面中提取特定文本块绝对可以（并且很容易）在正则表达式中执行。这就是正则表达式的目的。

这不是“解析HTML”，并且在诸如此类的情况下，格式已知并且应用程序不重要，正则表达式就可以了。

我刚检查过，Google提供了一个API，允许您在自定义搜索引擎上免费提出100个查询。 http://www.google.com/cse/ https://code.google.com/apis/console/?api=customsearch&pli=1#welcome

它需要一个Google帐户和一个API密钥，您可以在上面的链接中找到它。

警告，通过法律术语填充比写你的刮板要困难得多

Answer 3

使用此代码解析谷歌搜索结果的锚标记

function parseAnchor($strAnchor)
{
//$strAnchor = "<a onmousedown=\"return clk(this.href,'','','','2','','0CBwQFjAB')\" class=\"l\" href=\"http://php.net/manual/en/function.strpos.php\"><em>PHP</em>: strpos - Manual</a>";

$str_parts = explode(" ",$strAnchor);
$start_index = stripos($str_parts[4],"\"");
$length = strrpos($str_parts[4],"\"") - $start_index;
$link = substr($str_parts[4],$start_index+1,$length-1); //will print the link
print $link;

//Now get postion
$onmousedown_parts = explode(",",$str_parts[2]);

$position = trim($onmousedown_parts[4],"\'");
print "<br>$position"; //will print position
}

尝试此解析HTML页面

http://simplehtmldom.sourceforge.net/

Answer 4

使用 PHP类：Google关键字排名

http://www.phpclasses.org/package/5554-PHP-Determine-the-position-of-a-keyword-in-Google.html

<强> EX：

文件： google_position.php

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
    <title>Google Keyword Position</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body >
  <form name="url_kw" action="search.php" method="get">
      <label for="url">URL:</label>
      <input type="text" name="url" id="url" size="55" value="<?= isset($_GET['url']) ? $_GET['url'] : 'http://' ?>" />
      <br />
      <label for="keyword">Keyword:</label>
      <input type="text" name="keyword" id="keyword" size="35" value="<?= isset($_GET['keyword']) ? $_GET['keyword'] : null ?>" />
      <br />
      <input type="submit" name="submit_button" value="SEARCH" onclick="this.value='Searching...';" />
      <input type="button" value="CANCEL" onclick="javascript: window.location='<?= $_SERVER['HTTP_REFERER'] ?>';" />
      <br />
  </form>
</body>

文件 search.php

<?
include('KeywordPosition.php');
$position=new KeywordPosition($_GET['url'],$_GET['keyword'],10); // you can change the 10 to 100 to get more results :)
$index=$position->GetPosition();
if($index==-1)
echo 'Not in search results';
else
echo 'You are at '.$index;
?>

直播示例@ http://x.co/Z493

Preg_match（_all）无法从Google收集一些数据

4 个答案: