Question

我试图从网站上提取一些信息。

有一个部分看起来像这样：

<th>Some text here</th><td>text to extract</td>

我想找到（使用正则表达式或其他解决方案）以some text here开头的部分，并从中提取text to extract。

我试图使用以下正则表达式解决方案：

$reg = '/<th>Some text here<\/th><td>(.*)<\/td>/'; 
preg_match_all($reg, $content, $result, PREG_PATTERN_ORDER);

print_r($result);

但它只给我空数组：

Array ( [0] => Array ( ) [1] => Array ( ) )

我应该如何构建正则表达式以提取所需值？或者我可以用什么其他解决方案来提取它？

Answer 1

使用XPath：

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

$xp = new DOMXPath($dom);

$content = $xp->evaluate('string(//th[.="Some text here"]/following-sibling::*[1][name()="td"])');

echo $content;

XPath查询详情：

string(  # return a string instead of a node list
    //   # anywhere in the DOM tree
    th   # a th node
    [.="Some text here"] # predicate: its content is "Some text here"
    /following-sibling::*[1] # first following sibling
    [name()="td"] # predicate: must be a td node
)

您的模式不起作用的原因可能是因为td内容包含换行符（与点不匹配。）。

Answer 2

你可以使用DOMDocument。

$domd=@DOMDocument::loadHTML($content);
$extractedText=NULL;
foreach($domd->getElementsByTagName("th") as $ele){
    if($ele->textContent!=='Some text here'){continue;}
    $extractedText=$ele->nextSibling->textContent;
    break;
}
if($extractedText===NULL){
//extraction failed
} else {
//extracted text is in $extractedText
}

（正则表达式通常是解析HTML的坏工具，正如评论中的某些人已经指出的那样）

在PHP中解析HTML并提取值

2 个答案: