我用Google搜索了一半的互联网搜索帮助。
所以,我需要的是:
我有像这样解析的HTML结构:
<div class="foo">
<div class='bar' dir='ltr'>
<div id='p1' class='par'>
<p class='sb'>
<span id='dc_1_1' class='dx'>
<a href='/bar32560'>1</a>
</span>
Neque porro
<a href='/xyz' class='mr'>+</a>
quisquam est
<a href='/xyz' class='mr'>+</a>
qui.
</p>
</div>
<div id='p2' class='par'>
<p class='sb'>
<span id='dc_1_2' class='dx'>
<a href='/foo12356'>2</a>
</span>
dolorem ipsum
<a href='/xyz' class='mr'>+</a>
quia dolor sit amet,
<a href='/xyz' class='mr'>+</a>
consectetur, adipisci velit.
</p>
</div>
<div id='p3' class='par'>
<p class='sb'>
<span id='dc_1_3' class='dx'>
<a href='/foobar4586'>3</a>
</span>
Neque porro quisquam
<a href='/xyz' class='mr'>+</a>
est qui dolorem ipsum quia dolor sit
<a href='/xyz' class='mr'>+</a>
amet, t.
<a href='/xyz' class='mr'>+</a>
<span id='dc_1_4' class='dx'>
<a href='/barefoot4135'>4</a>
</span>
consectetur,
<a href='/xyz' class='mr'>+</a>
adipisci veli.
<span id='dc_1_5' class='dx'>
<a href='/barfoo05123'>5</a>
</span>
Neque porro
<a href='/xyz' class='mr'>+</a>
quisquam est
<a href='/xyz' class='mr'>+</a>
qui.
</p>
</div>
</div>
</div>
我需要什么(英文):抓每个段落但我需要最终刮下的文本对象内容:
scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body 4 => 4 consectetur, adipisci veli.
scraped_body 5 => 5 Neque porro quisquam est qui.
编码我现在使用的内容:
page = Nokogiri::HTML(open(url))
x = page.css('.mr').remove
x.xpath("//div[contains(@class, 'par')]").map do |node|
body = node.text
end
我的结果如下:
scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t. 4 consectetur, adipisci veli. 5 Neque porro quisquam est qui.
所以这会从div段类&#39; par&#39;中删除全文。我需要在每个跨度后用他的内容 - 数字来刮取整个文本。或者在每个跨度之前剪掉那些div。
我需要类似的东西:
SPAN.text + P.text - a.mr
我不知道...怎么做
请帮我解析一下。我需要在每个跨度之后/之前刮刮 - 我想。
请帮助,我已经尝试了所有我发现的东西。
我使用了以下代码:
def verses
page = Nokogiri::HTML(open(url))
i=0
x = page.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM").map do |node|
i+=1
body = node
VerseSource.new(body, book_num, number, i)
end
end
我需要这个,因为我解析了一个带有文字的大网站。还有更多的方法。所以我的最终输出如下:
Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam est qui.
但是,如果我单个句子有多个句子,那么你的代码就会用每个句子分开。所以这要分得很多。
例如:
<div id='p1' class='par'>
<p class='sb'>
<span id='dc_1_3' class='dx'>
<a href='/foobar4586'>1</a>
</span>
Neque porro quisquam. Est qui dolorem
<a href='/xyz' class='mr'>+</a>
<span id='dc_1_3' class='dx'>
<a href='/foobar4586'>2</a>
</span>
est qui dolorem ipsum quia dolor sit.
<a href='/xyz' class='mr'>+</a>
amet, t.
您的代码分割如下:
Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam.
Saved record with: book: 1, chapter: 1, verse: 2, body: Est qui dolorem
Saved record with: book: 1, chapter: 1, verse: 3, body: 2 est qui dolorem ipsum quia dolor sit.
希望你的意思。真的很感谢你的支持。如果你可以修改它就会很棒!
谢谢你的回答!当我在我的方法中使用你的代码时:它解析了非常糟糕的东西。
def verses
page = Nokogiri::HTML(open(url))
i=0
#page.css(".mr").remove
page.xpath("//div[contains(@class, 'par')]//span").map do |node|
node.content.strip.tap do |out|
while nn = node.next
break if nn.name == 'span'
out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
node = nn
end
end
i+=1
body = node
VerseSource.new(body, book_num, number, i)
end
end
输出如下:
Saved record with: book: 1, chapter: 1, verse: 1, body: <here is last part of last sentence in first paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 2, body: <here is last part of last sentence in second paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 3, body:
Saved record with: book: 1, chapter: 1, verse: 4, body:
Saved record with: book: 1, chapter: 1, verse: 5, body: <here is last sentence in third paragraph. It is after last "+" in this paragraph and have no more "+" signs(href)
正如你所看到的,我不知道它是如何造成这样的混乱;]你能用它做更多的事情吗?非常感谢!
问候!
答案 0 :(得分:0)
我将您的输入保存为&#34; temp.html&#34;在我的桌面上。
require 'open-uri'
require 'nokogiri'
$page_html = Nokogiri::HTML.parse(open("/home/user/Desktop/temp.html"))
output = $page_html.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM")
# I found the pattern ". " in every line, so i replaced ". " with (". HAM")
# I did that by using gsub(". ", ". HAM") this means replace ". " with ". HAM"
# then i split up the string with " HAM" so it preserved the "." in each item in the array
output = ["1 Neque porro quisquam est qui.", "2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.", "3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.", "4 consectetur, adipisci veli.", "5 Neque porro quisquam est qui."]
编辑:
%w[nokogiri open-uri].each{|gem| require gem}
$url = "/home/user/Desktop/temp.html"
def verses
page = Nokogiri::HTML(open($url))
i=0
x = page.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM") do |node|
i+=1
body = node
VerseSource.new(body, book_num, number, i)
end
end
答案 1 :(得分:0)
尝试类似:
x.xpath("//div[contains(@class, 'par')]//span").map do |node|
out = node.content.strip
if following = node.at_xpath('following-sibling::text()')
out << ' ' << following.content.strip
end
out
end
following-sibling::text()
XPATH将获得跨越后的第一个文本节点。
修改
我认为这可以满足您的需求:
html.xpath("//div[contains(@class, 'par')]//span").map do |node|
node.content.strip.tap do |out|
while nn = node.next
break if nn.name == 'span'
out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
node = nn
end
end
end
输出:
[
"1 Neque porro quisquam est qui.",
"2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.",
"3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.",
"4 consectetur, adipisci veli.",
"5 Neque porro quisquam est qui."
]
使用纯XPath也可以做到这一点(参见XPath axis, get all following nodes until),但从编码的角度来看,这个解决方案更简单。
编辑2
试试这个:
def verses
page = Nokogiri::HTML(open(url))
i=0
page.xpath("//div[contains(@class, 'par')]//span").map do |node|
body = node.content.strip.tap do |out|
while nn = node.next
break if nn.name == 'span'
out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
node = nn
end
end
i+=1
VerseSource.new(body, book_num, number, i)
end
end
答案 2 :(得分:0)
require 'nokogiri'
your_html =<<END_OF_HTML
<your html here>
END_OF_HTML
doc = Nokogiri::HTML(your_html)
text_nodes = doc.xpath("//div[contains(@class, 'par')]/p/child::text()")
results = text_nodes.reject do |text_node|
text_node.text.match /\A \s+ \z/x #Eliminate whitespace nodes
end
results.each_with_index do |node, i|
puts "scraped_body#{i+1} => #{node.text.strip}"
end
--output:--
scraped_body1 => Neque porro quisquam est qui.
scraped_body2 => dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.
scraped_body3 => Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body4 => consectetur, adipisci veli.
scraped_body5 => Neque porro quisquam est qui.
回答新的HTML:
require 'nokogiri'
html = <<END_OF_HTML
your new html here
END_OF_HTML
html_doc = Nokogiri::HTML(html)
current_group_number = nil
non_ws_text = [] #non_whitespace_text for each group
html_doc.css("div.par > p").each do |p| #p's that are direct children of <div class="par">
p.xpath("./node()").each do |node| #All Text and Element nodes that are direct children of p tag.
case node
when Nokogiri::XML::Element
if node.name == 'span'
node.xpath(".//a").each do |a| #Step through all the <a> tags inside the <span>
md = a.text.match(/\A (\d+) \z/xm) #Check for numbers
if md #Then found a number, so it's the start of the next group
if current_group_number #then print the results for the current group
print "scraped_body #{current_group_number} => "
puts "#{current_group_number} #{non_ws_text.join(' ')}"
non_ws_text = []
end
current_group_number = md[1] #Record the next group number
break #Only look for the first <a> tag containing a number
end
end
end
when Nokogiri::XML::Text
text = node.text
non_ws_text << text.strip if text !~ /\A \s+ \z/xm
end
end
end
#For the last group:
print "scraped_body #{current_group_number} => "
puts "#{current_group_number} #{non_ws_text.join(' ')}"
--output:--
scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body 4 => 4 consectetur, adipisci veli.
scraped_body 5 => 5 Neque porro quisquam est qui.