我需要从下面列出的html中提取实际的电话号码,但我不确定如何使用Nokogiri CSS,因为它周围没有html标签。当at_css(.phonetitle)它只解析Phone而不是数字。
<div class="detail">
<span class="address">Corner of Toorak Road and Chapel Street, South Yarra</span><br>
<span class="phonetitle">Phone</span> 95435 34341
<br><br>
</div>
答案 0 :(得分:0)
没有一点XPath无法处理:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri::HTML(<<-HERE)
<div class="detail">
<span class="address">
Corner of Toorak Road and Chapel Street, South Yarra
</span><br>
<span class="phonetitle">Phone</span> 95435 34341
<br><br>
</div>
HERE
puts doc.search('*[@class="detail"]/text()').text.strip
# => 95435 34341
答案 1 :(得分:0)
试试这个:
public static final int MAX_HTML_TAG_LENGTH = 10;
public static final String[] REGEX_HTTP_TAG_FILTER = new String[] {
"[\\t\\n\\r\\f]+",
"<(s|S)(c|C)(r|R)(i|I)(p|P)(t|T)[^>]*>.+?</(s|S)(c|C)(r|R)(i|I)(p|P)(t|T)>",
"<(s|S)(t|T)(y|Y)(l|L)(e|E)[^>]*>.+?</(s|S)(t|T)(y|Y)(l|L)(e|E)>",
"<[a-zA-Z]{1," + MAX_HTML_TAG_LENGTH + "}\\s*[^>]*>",
"</[a-zA-Z]{1," + MAX_HTML_TAG_LENGTH + "}>", "<!--.+?-->",
" ",
"[ ]{2,}+"
};
for (int i = 0; i < REGEX_HTTP_TAG_FILTER.length; i++) {
result = result.replaceAll(REGEX_HTTP_TAG_FILTER[i], " ");
}
答案 2 :(得分:0)
这是用于查找电话号码的XPath表达式:
*[@class='phonetitle']/following-sibling::text()
Python中的示例(您可以使用@Jörg W Mittag's answer将其移植到Ruby和nokogiri
):
#!/usr/bin/env python
from lxml import html
doc = html.fromstring("""
<div class="detail">
<span class="address">
Corner of Toorak Road and Chapel Street, South Yarra
</span><br>
<span class="phonetitle">Phone</span> 95435 34341
<br><br>
</div>
""")
pn, = doc.xpath("*[@class='phonetitle']/following-sibling::text()")
print pn.strip()
# -> 95435 34341
答案 3 :(得分:-1)
这很容易解析,因为电话号码本身没有明确的包装。它不是自己的或。
如果你把整个事情都搞砸了javascript,我想你可以通过使用split()方法将其分解。
var string = '<div class="detail">
<span class="address">Corner of Toorak Road and Chapel Street, South Yarra</span><br>
<span class="phonetitle">Phone</span> 95435 34341
<br><br>
</div>';
var a = string.split('Phone</span>');
var b = string.split('<br>',a[1]);
return b[0];