从网页中提取标题和摘要

时间:2015-08-15 16:29:42

标签: php string url meta

我正在尝试从arXiv页面中提取标题和摘要,例如http://arxiv.org/abs/1207.0102,我的代码目前看起来像

argv

当我运行此代码时,会出现此错误

  

警告:file_get_contents(http://arxiv.org/abs/1207.0102):失败了   开放流:HTTP请求失败! HTTP / 1.1 403禁止进入   C:\瓦帕\ WWW \ mysite的\的index.php

当我尝试使用不同的网址http://www.washingtontimes.com/时,不会发生此问题。

有谁知道为什么会这样?

此外,是否可以从此网页中提取摘要?

1 个答案:

答案 0 :(得分:0)

网站的回复是不允许空用户代理:

div {
    height:100px;
    width:100px;
    border-radius:100%;
    background-color:pink;
}

div.spinClass {
    animation-name: spinimation;
    animation-duration: 1s;
    animation-timing-function: linear;
    background-color: red;
}

@keyframes spinimation{
    0% {background-color: pink; transform: rotateY(0deg);}
    50% {background-color: pink; transform: rotateY(90deg); }
    100% {background-color: red; transform: rotateY(180deg);}
}

如果您使用例如用户代理" Mozilla / 5.0(Windows NT 6.1; Trident / 7.0; rv:11.0),如Gecko"在您的请求中,它将起作用:

HTTP/1.1 403 Forbidden

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title>403 Forbidden</title></head>
<body>
<h1>Access Denied</h1>

 <p>Sadly, your client does not supply a proper User-Agent,
 and is consequently excluded.</p>
 <p>We have an inordinate number of problems with automated scripts
 which do not supply a User-Agent, and violate the automated access
 guidelines posted at arxiv.org
 -- hence we now exclude them all.</p>
 <p>(In rare cases, we have found that accesses through proxy servers
 strip the User-Agent information. If this is the case, you need to contact
 the administrator of your proxy server to get it fixed.)</p>


<p>If you believe this determination to be in error, see
<b>http://arxiv.org/denied.html</b> for additional information.</p>
</body>
</html>