Question

我刚开始学习如何使用正则表达式从网站中提取数据。我的第一个目标是提取网站的标题。这是我的代码：

<?php 
    $data = file_get_contents('http://bctia.org');
    $regex = '/<title>(.+?)<\/title>/';
    preg_match($regex,$data,$match);
    var_dump($match); 
?>

var_dump的结果为空：

array(0) { }

起初我想，“也许bctia.org没有标题”？但是，情况并非如此，因为我检查了bctia.org的来源，并且确实包含<title>和</title>之间的内容。

然后我想，也许我的代码不起作用？但是，情况也是如此，因为我已将bctia.org替换为其他网站，例如bing.com或apple.com，他们都返回了正确的结果。例如，使用apple.com我得到正确的结果

array(2) { [0]=> string(20) "" [1]=> string(5) "Apple" }

所以我必须得出结论：bctia.org是一个非常特殊的网站，阻止我提取其标题......

我想知道是否确实如此？或许我的代码有一些我没有发现的问题？

提前谢谢！

Answer 1

此特定网站的服务器端代码假定客户端发送User-Agent标头，显然，您的PHP安装未配置为发送一个。因此返回500 Internal Server Error，导致file_get_contents返回false。

Source Error:
Line 66: //LOAD: Compatibility Mode
Line 67: //<meta http-equiv="X-UA-Compatible" content="IE=7,IE=9" />
Line 68: string BrowserOS = Request.ServerVariables["HTTP_USER_AGENT"].ToString();
Line 69: HtmlMeta compMode = new HtmlMeta();
Line 70: compMode.Content = "IE=7,IE=9";


Source File: c:\inetpub\wwwroot\BCTIA\Website\bctia\layouts\Main Layout.aspx.cs   
Line: 68

Stack Trace:
[NullReferenceException: Object reference not set to an instance of an object.]
   Layouts.Main_Layout.Page_Load(Object sender, EventArgs e) in c:\inetpub\wwwroot\BCTIA\Website\bctia\layouts\Main Layout.aspx.cs:68
   System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr fp, Object o, Object t, EventArgs e) +24
   System.Web.UI.Control.LoadRecursive() +70
   System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) +3063

要解决此问题，您只需在发出请求之前设置用户代理字符串：

ini_set('user_agent', 'Mozilla/5.0 (compatible; Examplebot/0.1; +http://www.example.com/bot.html)');

Answer 2

不要使用正则表达式.. !!

相反使用xpath查看：xpath

正则表达式不会很好。

Answer 3

使用正则表达式解析HTML代码并不是一个好方法，因为你可能对他宽松的结构感到惊讶。

您的模式不起作用的原因是该点与换行符不匹配。

如果您希望点匹配换行符，请使用模式末尾的s修饰符，或者不要使用点：

$regex = '/<title>(.+?)<\/title>/s';

或

$regex = '/<title>([^<]+)<\/title>/';

[^<]是一个字符类，其中包含<以外的所有字符，您可以看到，您不需要使用延迟量词：+而不是{{1} }}

为什么有些网站不可擦除？

3 个答案: