Question

我正在尝试重新创建我正在抓取的网站中给出的评论消息的文本，但是在处理文本在文本之间有图像的情况时遇到了问题。图像是笑脸图释。

例如，以下注释将显示为下方的HTML（假装＆＃34; alt＆＃34;是真实图像）

text text text blah blah blah :3some more text that will come directly after

<div>
    "text text text blah blah blah "
    <img src="/smiley.png" width="16" height="16" alt=":3" title>
    "some more text that will come directly after"
</div>

我希望有一种方法可以让<img ...>之前的字符数使用字符串的insert()方法来插入带有消息本身的alt文本。

任何人有任何其他想法或知道如何实施这样的解决方案？

当我在div元素上调用inspect时，我得到以下内容：

[#<Nokogiri::XML::Element:0x3fda6dc527cc name="div" children=[#<Nokogiri::XML::Text:0x3fda6dc52484 "text text text blah blah blah ">, #<Nokogiri::XML::Element:0x3fda6dc523a8 name="img" attributes=[#<Nokogiri::XML::Attr:0x3fda6dc52330 name="src" value="/smiley.png">, #<Nokogiri::XML::Attr:0x3fda6dc52308 name="width" value="16">, #<Nokogiri::XML::Attr:0x3fda6dc522b8 name="height" value="16">, #<Nokogiri::XML::Attr:0x3fda6dc522a4 name="alt" value=":3">]>, #<Nokogiri::XML::Text:0x3fda6d487470 "some more text that will come directly after">]>]

在发布此内容之前，我不知道自己能做什么。我打赌可以单独访问子列表/数组吗？

我最终将这个div元素转换为字符串并使用解析来获得我想要的内容。

如果有人有更优雅的解决方案，请告诉我！我全都是为了更多地了解它。

Answer 1

你问：

如何使用Nokogiri找出图像标记前有多少个字符？

img = doc.at('img')
img.previous.text.length

Answer 2

我不确定我完全理解。听起来你想要采用原始HTML并用替换文本替换所有图像标签？如果是这样，这将有效：

> html = '<div>
*     text text text blah blah blah
*     <img src="/smiley.png" width="16" height="16" alt=":3" title>
*     some more text that will come directly after
* </div>'

> doc = Nokogiri::HTML.fragment(html)
> doc.css('img').each {|img| img.replace(img.attr('alt'))}

> puts doc.at('div').text

    text text text blah blah blah
    :3
    some more text that will come directly after

如何使用Nokogiri找出图像标记前有多少个字符？

2 个答案: