我需要使用PHP阅读远程页面的三行内容。我正在使用Jose Vega的代码在这里阅读标题:
<?php
function get_title($url){
$str = file_get_contents($url);
if(strlen($str)>0){
$str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
return $title[1];
}
}
//Example:
echo get_title("http://www.washingtontimes.com/");
?>
插入URL时,我想提取以下信息:
<title>TITLE HERE</title>
<meta property="end_date" content="Tue Aug 28 2018 03:59:59 GMT+0000 (UTC)" />
<meta property="start_date" content="Mon Aug 06 2018 04:00:00 GMT+0000 (UTC)" />
输出:$title
,$start
,$end
显示为带有URL链接的标题,后跟开始:____,结束:____,最好转换为简单日期
奖金问题:如何使用此脚本有效地解析数十个站点?这些站点都在数字上递增。 index.php?id=103 index.php?id=104 index.php?id=105
显示:
ID Title Start End
#103 TitleWithLink StartDate EndDate
#104 TitleWithLink StartDate EndDate
#105 TitleWithLink StartDate EndDate
答案 0 :(得分:0)
根据您的问题,我猜您想读取元数据。我现在建议的部分代码摘自http://php.net/manual/en/function.get-meta-tags.php
。此SO页面正常工作,因此对您的SO页面也正常工作。当然,您需要对其稍加修改以完成任务。
function getUrlData($url, $raw=false) // $raw - enable for raw display
{
$result = false;
$contents = getUrlContents($url);
if (isset($contents) && is_string($contents))
{
$title = null;
$metaTags = null;
$metaProperties = null;
preg_match('/<title>([^>]*)<\/title>/si', $contents, $match );
if (isset($match) && is_array($match) && count($match) > 0)
{
$title = strip_tags($match[1]);
}
preg_match_all('/<[\s]*meta[\s]*(name|property)="?' . '([^>"]*)"?[\s]*' . 'content="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);
if (isset($match) && is_array($match) && count($match) == 4)
{
$originals = $match[0];
$names = $match[2];
$values = $match[3];
if (count($originals) == count($names) && count($names) == count($values))
{
$metaTags = array();
$metaProperties = $metaTags;
if ($raw) {
if (version_compare(PHP_VERSION, '5.4.0') == -1)
$flags = ENT_COMPAT;
else
$flags = ENT_COMPAT | ENT_HTML401;
}
for ($i=0, $limiti=count($names); $i < $limiti; $i++)
{
if ($match[1][$i] == 'name')
$meta_type = 'metaTags';
else
$meta_type = 'metaProperties';
if ($raw)
${$meta_type}[$names[$i]] = array (
'html' => htmlentities($originals[$i], $flags, 'UTF-8'),
'value' => $values[$i]
);
else
${$meta_type}[$names[$i]] = array (
'html' => $originals[$i],
'value' => $values[$i]
);
}
}
}
$result = array (
'title' => $title,
'metaTags' => $metaTags,
'metaProperties' => $metaProperties,
);
}
return $result;
}
function getUrlContents($url, $maximumRedirections = null, $currentRedirection = 0)
{
$result = false;
$contents = @file_get_contents($url);
// Check if we need to go somewhere else
if (isset($contents) && is_string($contents))
{
preg_match_all('/<[\s]*meta[\s]*http-equiv="?REFRESH"?' . '[\s]*content="?[0-9]*;[\s]*URL[\s]*=[\s]*([^>"]*)"?' . '[\s]*[\/]?[\s]*>/si', $contents, $match);
if (isset($match) && is_array($match) && count($match) == 2 && count($match[1]) == 1)
{
if (!isset($maximumRedirections) || $currentRedirection < $maximumRedirections)
{
return getUrlContents($match[1][0], $maximumRedirections, ++$currentRedirection);
}
$result = false;
}
else
{
$result = $contents;
}
}
return $contents;
}
$result = getUrlData('https://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html', true);
print_r($result);
的输出是:
Array
(
[title] => file get contents - PHP - Read three lines of remote html - Stack Overflow
[metaTags] => Array
(
[viewport] => Array
(
[html] => <meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">
[value] => width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0
)
[twitter:card] => Array
(
[html] => <meta name="twitter:card" content="summary"/>
[value] => summary
)
[twitter:domain] => Array
(
[html] => <meta name="twitter:domain" content="stackoverflow.com"/>
[value] => stackoverflow.com
)
[twitter:app:country] => Array
(
[html] => <meta name="twitter:app:country" content="US" />
[value] => US
)
[twitter:app:name:iphone] => Array
(
[html] => <meta name="twitter:app:name:iphone" content="Stack Exchange iOS" />
[value] => Stack Exchange iOS
)
[twitter:app:id:iphone] => Array
(
[html] => <meta name="twitter:app:id:iphone" content="871299723" />
[value] => 871299723
)
[twitter:app:url:iphone] => Array
(
[html] => <meta name="twitter:app:url:iphone" content="se-zaphod://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html" />
[value] => se-zaphod://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html
)
[twitter:app:name:ipad] => Array
(
[html] => <meta name="twitter:app:name:ipad" content="Stack Exchange iOS" />
[value] => Stack Exchange iOS
)
[twitter:app:id:ipad] => Array
(
[html] => <meta name="twitter:app:id:ipad" content="871299723" />
[value] => 871299723
)
[twitter:app:url:ipad] => Array
(
[html] => <meta name="twitter:app:url:ipad" content="se-zaphod://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html" />
[value] => se-zaphod://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html
)
[twitter:app:name:googleplay] => Array
(
[html] => <meta name="twitter:app:name:googleplay" content="Stack Exchange Android">
[value] => Stack Exchange Android
)
[twitter:app:url:googleplay] => Array
(
[html] => <meta name="twitter:app:url:googleplay" content="http://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html">
[value] => http://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html
)
[twitter:app:id:googleplay] => Array
(
[html] => <meta name="twitter:app:id:googleplay" content="com.stackexchange.marvin">
[value] => com.stackexchange.marvin
)
)
[metaProperties] => Array
(
[og:url] => Array
(
[html] => <meta property="og:url" content="https://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html"/>
[value] => https://stackoverflow.com/questions/51939042/php-read-three-lines-of-remote-html
)
[og:site_name] => Array
(
[html] => <meta property="og:site_name" content="Stack Overflow" />
[value] => Stack Overflow
)
)
)
然后实际使用它来实现您的目的:
如何使用此脚本有效地解析数十个站点?的 站点都在数字上上升。 index.php?id = 103 index.php?id = 104 index.php?id = 105
您需要:
-首先创建一个包含您的网址的array
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h2>HTML Table</h2>
<table>
<tr>
<th >Id</th>
<th >Title</th>
<th >start_date</th>
<th>end_date</th>
</tr>
<?php
$urls=array(103=>'index.php?id=103',104=> 'index.php?id=104', 105=>'index.php?id=105');
-然后遍历此array
:
foreach($urls as $id=>$url):
-每次迭代都使用函数getUrlData()
,如下所示:
$result=getUrlData($url, true);
-然后您使用例如:
检索所需的信息?><tr>
<td><?php echo $id; ?></td>
<td><?php echo $result['title']; ?></td>
<td><?php echo $result['metaProperties']['start_date']['value']; ?></td>
<td><?php echo $result['metaProperties']['end_date']['value']; ?></td>
</tr>
构建每行和每一行。 在该过程结束时,您将获得期望的表:
Endforeach;?>
</table></body>
</html>
答案 1 :(得分:0)
好吧,您可以使用DomDocument类解决问题。
$doc = new \DomDocument();
$title = $start = $end = '';
if ($doc->loadHTMLFile($url)) {
// Get the title
$titles = $dom->getElementsByTagName('title');
if ($titles->length > 0) {
$title = $titles->item(0)->nodeValue;
}
// get meta elements
$xpath = new \DOMXPath($doc);
$ends = $xpath->query('//meta[@property="end_date"]');
$if ($ends->length > 0) {
$end = $ends->item(0)->getAttribute('content');
}
$starts = $xpath->query('//meta[@property="start_date"]');
if ($starts->length > 0) {
$start = $starts->item(0)->getAttribute('content');
}
var_dump($title, $start, $end);
}
使用getElementsByTagName
类的DomDocument
方法,您可以在给定URL的整个html中找到title元素。使用DOMXPath
类,您可以检索所需的特定元数据。您无需太多代码即可在html字符串中查找特定信息。
上面显示的代码未经测试。