Question

我有两个大型的CSV文件。一个文件只是一个记录列表。另一个文件是记录列表，但第一列是它在另一个文件中修改的记录的行号。它不会取代整行;它只是替换了具有匹配头的行中的值。

例如：

文件1：

"First","Last","Lang"
"John","Doe","Ruby"
"Jane","Doe","Perl"
"Dane","Joe","Lisp"

文件2：

"Seq","Lang"
2,"Ruby"

目标是最终得到一个如下所示的文件：

"First","Last","Lang"
"John","Doe","Ruby"
"Jane","Doe","Ruby"
"Dane","Joe","Lisp"

然而，数据比这复杂得多，甚至可能包含CSV中的换行符。因此，我不能依赖行号，而是我必须依赖记录计数。（当然，除非我预先处理这两个文件以替换新行和回车..我认为这是可能的但不太有趣。）

我遇到的问题是如何在不将整个文件加载到内存中的情况下遍历这两个文件并进行正确的替换。我相信将100mb +文件加载到内存中是一个坏主意，对吗？

此外，生成的文件中的记录在完成后应按相同顺序排列。

Answer 1

如果文件太大而无法加载到内存中，这基本上就是我处理的方式

// pseudocode

f1 = fopen(file1)
f2 = fopen(file2)
f3 = fopen(newfile)

// loop through exceptions
foreach row2, index2 of f2

  // loop through file1 until a matched row is found
  while (row1, index1 of f1) && (row1 not null) && (row2[seq] <= index1)

    // patch
    if row2[seq] == index1
      row1[lang] = row2[lang]
    endif

    // write out to new file
    f3.write row1

  endwhile
endforeach

†由于您的file2有基于1的索引（而不是基于0），您需要启动index1和index2专柜1。

††如果lang不是您将永远替换的列：

// at the beginning of the foreach loop
if col is null
  cols = array_keys row2
  col = cols[2] // 1-based index
end

// the new patch block
if row2[seq] == index1
  row1[col] = row2[col]
endif

Answer 2

你需要2个枚举器，但由于它们没有嵌套，因此需要使用Enumerator＃next，这意味着你需要注意它引发EOF异常：

e = CSV.open('file2.csv', :headers => true).each
seq = e.next

output = CSV.open('output.csv', 'w')

csv = CSV.open('file1.csv')
csv.each do |row|
  if seq['Seq'].to_i == csv.lineno - 1
    row[2] = seq['Lang']
    seq = e.next rescue ({'Seq' => -1})
  end
  output << row
end

使用Ruby将CSV文件中的特定记录替换为另一个CSV文件中的记录

2 个答案: