Question

我正在尝试提取特定WordPress网站上所有视频的链接。每个页面只有一个视频。

在每个抓取的网页中，都有以下代码：

<p><script src="https://www.vooplayer.com/v3/watch/video.js"></script>
<iframe id="" voo-auto-adj="true" name="vooplayerframe" style="max-width:100%" allowtransparency="true" allowfullscreen="true" src="//www.vooplayer.com/v3/watch/watch.php?v=123456;clearVars=1" frameborder="0" scrolling="no" width="660" height="410" >
</iframe></p>

我想从here

中提取文字

Google Chrome Inspector告诉我，这可以解决为：

选择器：//*[@id="post-255"]/div/p/iframe
XPath：#post-255 > div > p > iframe

但我抓取的每个网页都有不同的“帖子”号码。它们非常随意，因此我不能轻易使用上述选择器。

Answer 1

如果id属性中有动态部分，您可以通过部分匹配来解决它：

[id^=post] > div > p > iframe

其中^=表示“以...开头”。

XPath替代方案：

//*[starts-with(@id, "post")]/div/p/iframe

另请参阅是否可以完全避免检查div和p个中间元素，并执行以下操作：

[id^=post] iframe
//*[starts-with(@id, "post")]//iframe

您还可以另外检查iframe名称：

[id^=post] iframe[name=vooplayerframe]
//*[starts-with(@id, "post")]//iframe[@name = "vooplayerframe"]

提取特定文字

1 个答案: