Question

我无法从字符串中删除空格。

我的HTML是：

<p class='your-price'>
Cena pro Vás: <strong>139&nbsp;<small>Kč</small></strong>
</p>

我的代码是：

#encoding: utf-8
require 'rubygems'
require 'mechanize'

agent = Mechanize.new
site  = agent.get("http://www.astratex.cz/podlozky-pod-raminka/doplnky")
price = site.search("//p[@class='your-price']/strong/text()")

val = price.first.text  => "139 "
val.strip               => "139 "
val.gsub(" ", "")       => "139 "

gsub，strip等不起作用。为什么，以及如何解决这个问题？

val.class      => String
val.dump       => "\"139\\u{a0}\""      !
val.encoding   => #<Encoding:UTF-8>

__ENCODING__               => #<Encoding:UTF-8>
Encoding.default_external  => #<Encoding:UTF-8>

我正在使用Ruby 1.9.3，因此Unicode不应该是问题。

Answer 1

strip只删除ASCII个空格，你在这里得到的字符是Unicode不间断空格。

删除角色很简单。您可以通过提供带有字符代码的正则表达式来使用gsub： gsub(/\u00a0/, '')

您也可以调用gsub(/[[:space:]]/, '')删除所有Unicode空格。有关详细信息，请查看the documentation

Answer 2

如果我想删除不间断空格"\u00A0"又称 ，则可以执行以下操作：

require 'nokogiri'

doc = Nokogiri::HTML("&nbsp;")

s = doc.text # => " "

# s is the NBSP
s.ord.to_s(16)                   # => "a0"

# and here's the translate changing the NBSP to a SPACE
s.tr("\u00A0", ' ').ord.to_s(16) # => "20"

因此tr("\u00A0", ' ')可以让您到达想要的位置，此时，NBSP现在是一个空格：

tr非常快速且易于使用。

另一种方法是在从HTML提取实际编码字符“  ”之前对其进行预处理。这是经过简化的，但它适用于整个HTML文件以及字符串中的单个实体：

s = "&nbsp;"

s.gsub('&nbsp;', ' ') # => " "

为目标使用固定字符串比使用正则表达式更快：

s = "&nbsp;" * 10000

require 'fruity'

compare do
  fixed { s.gsub('&nbsp;', ' ') }
  regex { s.gsub(/&nbsp;/, ' ') }
 end

# >> Running each test 4 times. Test will take about 1 second.
# >> fixed is faster than regex by 2x ± 0.1

如果需要正则表达式，它们会很有用，但是它们会大大降低代码的速度。

我无法从Nokogiri解析的字符串中删除空格

2 个答案: