Question

我有使用tt作为分隔符的csv文件。但是在某些边缘情况下，因为某些值可能以t结尾而被破坏。我正在使用此Gem https://github.com/tilo/smarter_csv来获取csv数据。

数据损坏示例：

4909ttZSWttPrince RupertttCanadattCAttNorth Americatt54.3333tt-130.283

输出:(记下城市和国家/地区值）

{:id=>4909, :code=>"ZSW", :city=>"Prince Ruper", :country=>"tCanada", :country_code=>"CA", :continent=>"North America", :coordinate_x=>54.3333, :coordinate_y=>-130.283}

有没有办法告诉csv读者如果单词以t结尾检查下一个字符是否以大写字母开头，否则不要分开。（注意（tt）t重复）。这是我目前的代码：

options = {
  :col_sep => 'tt',
  :headers_in_file => false,
  :user_provided_headers => [
    "id",
    "code",
    "city",
    "country",
    "country_code",
    "continent",
    "coordinate_x",
    "coordinate_y"
  ]
}
records = SmarterCSV.process(filename, options)

Answer 1

这里的智能宝石似乎太愚蠢了。

我会选择：

File.readlines('path/to/file').map do |line|
  line.split(/tt(?=[^t])/)
end

这将生成一个数组数组。您是否希望获得此“智能”宝石提供的输出：

File.readlines('path/to/file').map do |line|
  user_provided_headers.zip(line.split(/tt(?=[^t])/)).to_h
end

瞧。上述工作，假设单元格值不能以小写“t”开头。

Sidenote ：我想知道我们在“如何使用别人编写的代码”方面经验太快，而且懒得自己简单地编写小代码。

Answer 2

受到@mudasobwa回答的启发，我还找到了另一个解决方案，我没有太多改变代码。我将:col_sep值从tt替换为@mudasobwa提供的正则表达式。

options = {
  :col_sep => /tt(?=[^t]|tt)/,
  :headers_in_file => false,
  :user_provided_headers => [
    "id",
    "code",
    "city",
    "country",
    "country_code",
    "continent",
    "coordinate_x",
    "coordinate_y"
  ]
}
records = SmarterCSV.process(filename, options)

版本说明：我已经替换了正则表达式

  /tt(?=[^t])/

与

 /tt(?=[^t]|tt)/

允许零值。

Answer 3

为什么不将分隔符字符串替换为;之类的其他内容？也许听起来像额外的工作，但它会节省你很多时间，并只是执行这样的事情：

 "HolattCatttHey".gsub(/(tt[A-Z])/) { |m| ";#{($1).sub('tt','')}"}
 => "Hola;Cat;Hey"

一旦这个，你可以愉快地使用你的宝石。

设计糟糕的csv文件会破坏值

3 个答案: