CSV-在ruby中提取数据并写入另一个CSV

时间:2018-10-25 18:24:49

标签: ruby-on-rails csv export-to-csv

我有一个名为“ texas_boundaries.csv”的大文件,它由纬度/经度对组成,描述了数百个上学边界。它是一个800 + MB的文件,太大,无法上传到heroku。我只需要某些学校的边界,所以我试图仅查找所需的行,并使用以下代码将它们写入新文件:

desc "Reduce texas csv to only needed schools"
task :reduce_texas => :environment do

  require 'csv'

  file = "texas_boundaries.csv"
  headers = CSV.open(file, &:readline)
  nces_ids = School.pluck(:nces_id).uniq
  nces_ids_track = nces_ids
  file_name = 'texas_reduced_boundaries.csv'

  CSV.open(file_name, 'a') do |csv|
    csv << headers
  end

  CSV.foreach(file, :headers => true, encoding: "UTF-8") do |row|
    if nces_ids.include?(row['ncessch'])
      CSV.open(file_name, 'a') do |csv|
        csv << row
        p row['ncessch']
        nces_ids_track.delete(row['ncessch'])
      end
    end
  end

  p "Nces_ids not in reduced boundaries file: #{nces_ids_track.count}"
  p nces_ids_track

end

每所学校都有数十个点来描述其边界,但是当我运行此代码时,新文件中仅记录了一个点。控制台输出证明了这一点,我希望在更改为新的nces_id之前,多次出现相同的nces_id。

tomb$ rake reduce_texas
"480000801507"
"480000801508"
"480000806094"
"480000806989"
"480000811280"
"480000905246"

这是实际数据文件的屏幕截图,显示了很多行,其nces_id =480000801507。

texas_boundaries.csv

仅第一行记录在新文件中。

texas_reduced_boundaries.csv

任何帮助将不胜感激!附带说明,此过程非常缓慢,因此,如果有任何读者看到加快速度的方法,请告诉我。

1 个答案:

答案 0 :(得分:3)

这看起来很可疑:

nces_ids = School.pluck(:nces_id).uniq
nces_ids_track = nces_ids

赋值不会复制nces_ids数组,它只会复制引用。结果是nces_idsnces_ids_track引用相同的数组。稍后您可以这样做:

if nces_ids.include?(row['ncessch'])
  CSV.open(file_name, 'a') do |csv|
    #...
    nces_ids_track.delete(row['ncessch'])
  end
end

但是nces_idsnces_ids_track引用相同的数组,而不是期望的不同数组。

也许你想说:

nces_ids = School.pluck(:nces_id).uniq
nces_ids_track = nces_ids.dup
# -----------------------^^^^

这样您就可以使用数组的两个副本。