Question

我正在编写一个Ruby（1.9.3）脚本，它从文件夹中读取XML文件，然后在必要时进行编辑。

我的问题是我获得了由Tidy转换的XML文件，但其输出有点奇怪，例如：

<?xml version="1.0" encoding="utf-8"?>
<XML>
  <item>
      <ID>000001</ID>
      <YEAR>2013</YEAR>
      <SUPPLIER>Supplier name test,
      Coproration</SUPPLIER>
...

你可以看到有和更多的CRLF。我不知道为什么它有这种行为，但我用ruby脚本解决它。但我遇到麻烦，因为我需要看到该行的最后一个字符是“＆gt; ”还是第一个字符是“＆lt; ”以便我可以看到如果标记有问题。

我试过了：

Dir.glob("C:/testing/corrected/*.xml").each do |file|

puts file

  File.open(file, 'r+').each_with_index do |line, index|

    first_char = line[0,1]

    if first_char != "<"
        //copy this line to the previous line and delete this one?
    end

  end

end

我还觉得我应该复制原始文件内容，因为我将其读取到另一个临时文件然后覆盖。这是最好的“方式”吗？欢迎提出任何提示，因为我在更改文件内容方面没有太多经验。

此致

Answer 1

额外\n是否始终出现在<SUPPLIER>节点中？正如其他人所建议的那样，Nokogiri是解析XML（或HTML）的绝佳选择。您可以遍历每个<SUPPLIER>节点并删除\n字符，然后将XML另存为新文件。

require 'nokogiri'

# read and parse the old file
file = File.read("old.xml")
xml = Nokogiri::XML(file)

# replace \n and any additional whitespace with a space
xml.xpath("//SUPPLIER").each do |node|
  node.content = node.content.gsub(/\n\s+/, " ")
end

# save the output into a new file
File.open("new.xml", "w") do |f|
  f.write xml.to_xml
end

Ruby - 读取和编辑XML文件

1 个答案: