Python:匹配两个csv文件之间的值

时间:2016-11-30 22:09:56

标签: python python-2.7 csv

我正在解析两个不同的csv文件,需要匹配它们之间的列。目前,当我运行代码段时,它实际上在两个csv文件之间存在匹配的地址时,它不返回匹配的值。我遇到的问题是OnlineData csv文件中的地址字段的缩写。例如:

In the Addresses csv                             In the OnlineData csv
  4587 Newton Road                                    4587 Newton Rd
  7854 Food Court                                     7854 Food Ct

如何告诉Python只查找 数字(' 4587')和第一个字(' Newton&#39 ;)在查找匹配值时在两个csv文件中。

import csv


Addresses = set()

with open ('Addresses.csv') as f:
    for row in csv.reader(f):
        Addresses.add(row[1])

OnlineData = set()

with open ('C:/Users/OnlineData.csv') as g:
    for row in csv.reader(g):
        PermitData.add(row[1])


results = Addresses & OnlineData


print 'There are', len(results), 'matching addresses between the two csv files'

for result in sorted(results):
    print result

1 个答案:

答案 0 :(得分:1)

由于您只对匹配部分数据感兴趣,因此您可以将该部分加载到set中,然后执行交集。

import csv

Addresses = set()
with open ('Addresses.csv') as f:
    for row in csv.reader(f):
        portion = ' '.join(row[1].split()[:-1])  # Loads "4587 Newton" instead of "4587 Newton Road"
        Addresses.add(portion)

OnlineData = set()
with open ('C:/Users/OnlineData.csv') as g:
    for row in csv.reader(g):
        portion = ' '.join(row[1].split()[:-1])
        OnlineData.add(portion)

results = Addresses & OnlineData

print 'There are', len(results), 'matching addresses between the two csv files'

for result in sorted(results):
    print result

明显的缺点是你丢失了一些你仍然可以检索的信息。另一种选择是规范化输入,这意味着您可以将Rd替换为Road,将Ct替换为Court,以便始终匹配信息。< / p>