Question

要提取网址，我使用以下内容：

html = open('http://lab/links.html')
urls = URI.extract(html)

这很有效。

现在我需要提取一个没有前缀http或https的URL列表，它们位于 个标签之间。由于没有http或https标记，URI.extract不起作用。

domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php

每个未加前缀的网址都在 个标记之间。

~~我一直在看这个Nokogiri Xpath to retrieve text after within <TD> and ，但无法让它发挥作用。~~

输出

domain1.com/index.html
domain2.com/home/~john/index.html
domain3.com/a/b/c/d/index.php

~~中间解决方案~~

~~doc = Nokogiri::HTML(open("http://lab/noprefix_domains.html")) doc.search('br').each do |n| n.replace("\n") end puts doc~~

~~我仍然需要删除剩余的HTML标记（!DOCTYPE, html, body, p）...~~

的解决方案的

str = ""
doc.traverse { |n| str << n.to_s if (n.name == "text" or n.name == "br") }
puts str.split /\s*<\s*br\s*>\s*/

感谢。

Answer 1

假设您已经有一个方法来提取您在问题中显示的示例字符串，您可以在字符串上使用split：

str = "domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php"
str.split /\s*<\s*br\s*>\s*/
#=> ["domain1.com/index.html", 
#    "domain2.com/home/~john/index.html",
#    "domain3.com/a/b/c/d/index.php"]

这将在每个 标记处拆分字符串。它还会移除 之前和之后的空格，并允许 标记内的空格，例如 或 。如果您还需要处理自动关闭标记（例如 ），请改用此正则表达式：

/\s*<\s*br\s*\/?\s*>\s*/

在<br/>标签之间提取文字

1 个答案: