我正在尝试进行简单的提取,但最终会产生不可预测的结果。
我有这个HTML代码
<div class="thread" style="margin-bottom:25px;">
<div class="message">
<span class="profile">Suzy Creamcheese</span>
<span class="time">December 22, 2010 at 11:10 pm</span>
<div class="msgbody">
<div class="subject">New digs</div>
Hello thank you for trying our soap. <BR> Jim.
</div>
</div>
<div class="message reply">
<span class="profile">Lars Jörgenmeier</span>
<span class="time">December 22, 2010 at 11:45 pm</span>
<div class="msgbody">
I never sold you any soap.
</div>
</div>
</div>
我试图从“msgbody”中提取outertext,但只有在“profile”等于某事时才会提取。像这样。
$contents = $html->find('.msgbody');
$elements = $html->find('.profile');
$length = sizeof($contents);
while($x != sizeof($elements)) {
$var = $elements[$x]->outertext;
//If profile = the right name
if ($var = $name) {
$text = $contents[$x]->outertext;
echo $text;
}
$x++;
}
我从错误的配置文件中获取文本,而不是具有我需要的关联的文本。 有没有办法用一行代码拉出所需的信息?
如果span-profile =“正确的名称”,那么就像 拉它的div-msgbody
答案 0 :(得分:3)
好的,我将在这个问题上使用DOMXpath。我不确定“外文”是什么意思,但我会遵循这个要求:
如果span-profile =“正确名称” 然后拉它的div-msgbody
首先,这是我使用的缩小的HTML测试用例:
<html>
<body>
<div class="thread" style="margin-bottom:25px;">
<div class="message">
<span class="profile">Suzy Creamcheese</span>
<span class="time">December 22, 2010 at 11:10 pm</span>
<div class="msgbody">
<div class="subject">New digs</div>
Hello thank you for trying our soap. <BR> Jim.
</div>
</div>
<div class="message reply">
<span class="profile">Lars Jörgenmeier</span>
<span class="time">December 22, 2010 at 11:45 pm</span>
<div class="msgbody">
I never sold you any soap.
</div>
</div>
</div>
</body>
</html>
因此,我们将为此进行XPath查询。让我们展示整个事情,然后将其分解:
$messages = $xpath->query("//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']");
分解:
//跨度
给我跨度
//跨度[@类= '轮廓']
给我跨越课程的地方 简档
// span [@ class ='profile'和 含有(。, '$ PROFILE_NAME')]
给我跨越课程的地方 轮廓和跨度的内部 包含
$profile_name
,即 你以后的名字// span [@ class ='profile'和 含有(。, '$ PROFILE_NAME')] /../
给我跨越课程的地方 轮廓和跨度的内部 包含
$profile_name
,即 你现在的名字上升到一个水平, 这让我们到<div class="message">
// span [@ class ='profile'和 含有(。, '$ PROFILE_NAME')] /../的div [@类= 'msgbody']
给我跨越课程的地方 轮廓和跨度的内部 包含
$profile_name
,即 你现在的名字上升到一个水平, 这让我们到<div class="message">
,最后,给我<div class="message">
下的所有div 这个类是msgbody
现在,这是PHP代码的示例:
$doc = new DOMDocument();
$doc->loadHTMLFile("test.html");
$xpath = new DOMXpath($doc);
$profile_name = 'Lars Jörgenmeier';
$messages = $xpath->query("//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']");
foreach ($messages as $message) {
echo trim("{$message->nodeValue}") . "\n";
}
XPath非常强大。我建议查看basic tutorial,然后您可以查看XPath standard是否要查看更多高级用法。
答案 1 :(得分:0)
这是一个简单的HTML DOM工作示例。
我更改了您的示例html,因此Suzy Creamcheese将有多个配置文件,如下所示:(file:test_class_class.htm)
<div class="message">
<span class="profile">Suzy Creamcheese</span>
<span class="time">December 22, 2010 at 11:10 pm</span>
<div class="msgbody">
<div class="subject">New digs</div>
Hello thank you for trying our soap. <BR> Jim.
</div>
</div>
<div class="message reply">
<span class="profile">Lars Jörgenmeier</span>
<span class="time">December 22, 2010 at 11:45 pm</span>
<div class="msgbody">
I never sold you any soap.
</div>
</div>
</div>
<div class="message">
<span class="profile">Suzy Yogurt</span>
<span class="time">December 22, 2010 at 11:10 pm</span>
<div class="msgbody">
<div class="subject">No Creamcheese</div>
This is not Suzy Creamcheese <BR> Jim.
</div>
</div>
<div class="message reply">
<span class="profile">Suzy Creamcheese</span>
<span class="time">December 22, 2010 at 11:45 pm</span>
<div class="msgbody">
A reply from Suzy Creamcheese.
</div>
</div>
</div>
</div>
以下是使用Simple HTML DOM的测试: 包括( 'simple_html_dom.php');
function getMessage_for_profile($iUrl,$iProfile)
{
// create HTML DOM
$html = file_get_html($iUrl);
// get text elements
$aoProfile = $html->find('span[class=profile]');
echo "Found ".count($aoProfile)." profiles.<br />";
foreach ($aoProfile as $key=>$oProfile)
{
if ($oProfile->plaintext == $iProfile)
{
echo "<b>Profile ".$key.": ".$oProfile->plaintext."</b><br />";
// Using $e->next_sibling ()
$oCurrent = $oProfile;
while ($oNext = $oCurrent->next_sibling())
{
if ( $oNext->class == "msgbody" )
{
echo "<hr />";
echo $oNext->outertext;
echo "<hr />";
}
$oCurrent = $oNext;
}
}
}
// clean up memory
$html->clear();
unset($html);
return;
}
// --------------------------------------------
// test it!
// user_agent header...
ini_set('user_agent', 'My-Application/2.5');
getMessage_for_profile('test_class_class.htm','Suzy Creamcheese');
echo "<br /><br /><br />";
getMessage_for_profile('test_class_class.htm','Suzy Yogurt');
我的输出是:
Found 4 profiles.
Profile 0: Suzy Creamcheese
--------------------------------
New digs
Hello thank you for trying our soap.
Jim.
---------------------------------
Profile 3: Suzy Creamcheese
---------------------------------
A reply from Suzy Creamcheese.
---------------------------------
Found 4 profiles.
Profile 2: Suzy Yogurt
---------------------------------
No Creamcheese
This is not Suzy Creamcheese
Jim.
---------------------------------
看到它可以用Simple HTML DOM完成,因为我已经知道DOM是如何工作的......或者足以让我遇到麻烦......我不需要学习任何已知的语法!