Question

我有这个html片段：

<tr>
    <th scope="row" style="text-align:left;">Appeared in</th>
    <td class="" style="">1972<sup id="cite_ref-dottcl_2_2-0" class="reference"><a href="#cite_note-dottcl_2-2"><span>[</span>2<span>]</span></a></sup></td>
</tr>
<tr>
    <th scope="row" style="text-align:left;">Usual 
<a href="/wiki/Filename_extension" title="Filename extension">filename extensions</a>
    </th>
    <td class="" style="">.h .c</td>

</tr>

我使用//th//text()表达式来解析它。

问题是它正在返回['Appeared in', 'Usual', 'filename extensions']。

我想要的是['Appeared in', 'Usual filename extensions']。

Answer 1

执行此操作需要XPath 2.0，这些脚本语言的大多数XML库（包括scrapy）都不支持。

如果你可以使用功能更强大的XPath处理器（也看看XQuery 1.0及更新版本，它们都至少包含XPath 2.0作为子集），请使用：

//th/data()

/data()相当于调用当前上下文函数的/data(.)。

`data()` vs `text()`

虽然text()不是函数调用，但节点过滤器（因而//text()是将所有文本节点分别添加到结果序列的轴步骤），data()是一个函数聚合当前上下文的所有数据（此处：每个<th/>单独）。

XPath 1.0限制

无法单独调用任何连接每个表头元素的字符串的函数：不支持轴步骤中的函数调用，也不支持在XPath 2.0中可以使用的显式循环。

Answer 2

啊我会因为使用regex解析HTML而被推翻，但无法提供帮助：

$html = '<tr>
    <th scope="row" style="text-align:left;">Appeared in</th>
    <td class="" style="">1972<sup id="cite_ref-dottcl_2_2-0" class="reference"><a href="#cite_note-dottcl_2-2"><span>[</span>2<span>]</span></a></sup></td>
</tr>
<tr>
    <th scope="row" style="text-align:left;">Usual 
<a href="/wiki/Filename_extension" title="Filename extension">filename extensions</a>
    </th>
    <td class="" style="">.h .c</td>

</tr>';

$html = str_replace("\r", '', str_replace("\n", '', $html)); // Remove new lines
preg_match_all('#<th[^>]*>(.*?)</th>#is', $html, $m); // Match what's between th tag

$result = array_map('strip_tags', $m[1]); // Get ride of html tags
print_r($result);// printing the results

<强>输出：

Array
(
    [0] => Appeared in
    [1] => Usual filename extensions    
)

在Xpath中加入XML / HTML的后代文本节点

2 个答案:

`data()` vs `text()`

XPath 1.0限制

在Xpath中加入XML / HTML的后代文本节点

2 个答案:

data() vs text()

XPath 1.0限制

`data()` vs `text()`