Question

我很难绕过这个。

我有两个组成光纤网络的CSV：一个用于纬度，一个用于经度。这些是从KMZ文件中提取的，由于构建不良的KMZ，每个CSV都包含170k行。

我有一个潜在客户的CSV，我想与光纤网络进行比较。如果最小距离（使用Haversine公式计算）小于5280英尺，则将其输出到输出csv文件。

过去，当我没有那么多纬度/长对时，我已经取得了成功：过去20k，但现在我们有170k。输出csv文件变得庞大，你可以想象：300万行和计数。

然后我要做的是做一个检查（通常使用MySQL MIN（）函数，但我确定有更好的方法）返回每个地址的最小距离和分组地址：因为你真的只关心每个地址的最小距离。您不希望每个地址有多行。

require 'csv'
require 'haversine'

#this could be put into one file, works as is
fib_lat = CSV.read("swfl_fiber_lat.csv")
fib_long = CSV.read("swfl_fiber_long.csv")

#use zip to read both arrays at the same time
fib_coords = fib_lat.map(&:last).zip(fib_long.map(&:last))

#multiple column CSV with customer data, headers turned on
customers = CSV.read("swfl_1a_geocoded.csv", headers:true)

CSV.open('swfl-output-data-within-1mile.csv','w', :write_headers=> true, :headers => ['First Name','Last Name','Latitude','Longitude','Feet to Fiber','Address','City','State','Zip','County','Company','Title Code Description','PrimarySIC6 Description','Business Status Code Description','Phone Number','Tollfree Phonenumber','EmployeeSize Location Description','Sales Volume Location Decode','Telecommunications Expense','Email Address']) do |csv_object|
    fib_coords.each do |fib_lat, fib_long|
        customers.each do |cust|       
            if (Haversine.distance(cust[2].to_f, cust[3].to_f, fib_lat.to_f, fib_long.to_f).to_feet < 5280)
                data_out = ["#{cust[0]},#{cust[1]},#{cust[2].to_f},#{cust[3].to_f}, #{Haversine.distance(cust[2].to_f, cust[3].to_f, fib_lat.to_f, fib_long.to_f).to_feet.round(2)},#{cust[5]},#{cust[6]},#{cust[7]},#{cust[8]},#{cust[9]},#{cust[10]},#{cust[11]},#{cust[12]},#{cust[13]},#{cust[14]},#{cust[15]},#{cust[16]},#{cust[17]},#{cust[18]}"]            
                csv_object << data_out
            end
        end
    end
end

我试图想出一种方法来退回客户（可能只使用.uniq arr#min而只使用每位客户的最小地址而不将其推送到输出CSV中。然后，如果有的话确实距离低于5,280且是相关客户，只将其放入输出CSV数组中。

关于伪代码：如果距离是每个客户的最小值，请确保客户值是唯一的，然后将其推送到输出CSV。只是不是100％关于如何在我的一系列循环中实现这一点。

赞赏任何和所有的见解。

Answer 1

首先，您的表现问题在哪里？我假设它不是在计算fib_coords而是在循环customers。我会做出一些改变：

1）我不会一次性将整个客户的CSV文件读入内存，而是使用customers方法遍历CSV::for_each CSV文件。加载整个CSV文件可能使用了相当多的内存，可以更好地用于fib_coords数组。这意味着颠倒customers和fib_coords循环的顺序。

2）其次，您可以避免搜索整个fib_coords数组。如果您按第一列排序以使其按纬度顺序排列，请计算最小可能纬度（customer.latitude - 5280ft），使用fib_coords在bsearch中查找第一个可能的匹配，这比线性搜索并从那里循环遍历fib_coords，直到fib_coords中的纬度超出范围（> customer.latitude + 5280ft）。

Ruby：检查数组是否有唯一值并返回最小距离（半正弦公式）

1 个答案: