Question

我正在尝试提取<a>标记内的链接（href）和文本，以获取html页面中的许多链接。

我只想要特定的链接，这些链接由子字符串匹配。

我的html示例：

<a href="/this/dir/1234/">This should be 1234</a> some other html
<a href="/this/dir/1236/">This should be 1236</a> some other html
<a href="/about_us/">Not important link</a> some other html

我正在使用Xidel，这使我可以避免使用正则表达式。这似乎是最简单的工作。

到目前为止我所拥有的：

xidel -e "//a/(@href[contains(.,'/this/dir')],text())"

基本上可以，但是仍然存在两个问题：

我得到的数据由换行分隔。我想把它放在同一行。
返回每个链接文本，所以我也收到文本“不重要的链接”。

推荐的输出方式是什么

/this/dir/1234  ; This should be 1234
/this/dir/1236  ; This should be 1236

感谢所有反馈/提示。

修改：

Martin提供的解决方案在那里达到了99％。没有输出换行符，因此我正在使用awk用换行符替换虚拟文本。

注意：我在Windows上。

xidel myhtml.htm -e "string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), 'XXX')" | awk -F "XXX" "{$1=$1}1" "OFS=\n"

Answer 1

您可以将条件移至谓词，例如//a[contains(@href, '/this/dir')]!(@href, string())。至于结果格式，如果将所有内容委托给XQuery使用

string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '&#10;')

使用Xidel（仅特定链接）在同一行上提取href和文本

1 个答案: