更快的CSV +尝试查找唯一项目

时间:2011-12-06 03:20:17

标签: ruby file-io unique fastercsv

我有一个csv文件,我试图在第2列的列中找到所有uniq值,其中第1列具有相同的值并将其合并到新的csv文件中。我知道,这听起来令人困惑,所以这是一个例子:

原始文件foo.csv的样本:

"Boom Lifts","Model Number","Manufacturer","Platform Height","Horizontal Outreach","Lift Capacity"
"Boom Lifts","Model Number","Platform Height","Horizontal Outreach","Up & Over Height","Platform Capacity"
"Boom Lifts","Model Number","Platform Height","Horizontal Outreach","Up & Over Height"
"Pusharound Lifts","Model Number","Manufacturer","Platform Height","Stowed Height"
"Scissor Lifts","Model Number","Manufacturer","Platform Height","Stowed Height","Overall Dimensions","Platform Extension"
"Scissor Lifts","Overall Dimensions","Platform Size","Platform Extension","Lift Capacity"

理想的结果bar.csv:

"Boom Lifts","Model Number","Manufacturer","Platform Height","Horizontal Outreach","Lift Capacity","Up & Over Height","Platform Capacity",,,
"Pusharound Lifts","Model Number","Manufacturer","Platform Height","Stowed Height"
"Scissor Lifts","Model Number","Manufacturer","Platform Height","Stowed Height","Overall Dimensions","Platform Size","Platform Extension","Lift Capacity"

每一行都有不同的长度,这是一个非常庞大的文件(超过5k行),我完全不知道如何进行匹配/字符串操作。是的,其中一些行的尾随逗号有“空单元格”。我一直在使用Faster CSV,所以如果有办法做到这一点,那就太好了。

指针?最好不会让我的mbp嘎然而止的东西?

1 个答案:

答案 0 :(得分:1)

假设您可以使用更快的CSV进入二维数组:

a = [
  ["Boom Lifts","Model Number","Manufacturer","Platform Height","Horizontal Outreach","Lift Capacity"]
  ["Boom Lifts","Model Number","Platform Height","Horizontal Outreach","Up & Over Height","Platform Capacity"]
  ["Boom Lifts","Model Number","Platform Height","Horizontal Outreach","Up & Over Height"]
  ["Pusharound Lifts","Model Number","Manufacturer","Platform Height","Stowed Height"]
  ["Scissor Lifts","Model Number","Manufacturer","Platform Height","Stowed Height","Overall Dimensions","Platform Extension"]
  ["Scissor Lifts","Overall Dimensions","Platform Size","Platform Extension","Lift Capacity"]
]

a.group_by {|e| e[0]}.map {|e| e.flatten.uniq}

得到你:

[
  ["Boom Lifts", "Model Number", "Manufacturer", "Platform Height", "Horizontal Outreach", "Lift Capacity", "Up & Over Height", "Platform Capacity"]
  ["Pusharound Lifts", "Model Number", "Manufacturer", "Platform Height", "Stowed Height"]
  ["Scissor Lifts", "Model Number", "Manufacturer", "Platform Height", "Stowed Height", "Overall Dimensions", "Platform Extension", "Platform Size", "Lift Capacity"]
]

不会是即时的,但不应该降低你的MBP。