我有两个csv
文件,其结构如下:
Fil1.csv:
66054,14.7065,42.1115
66054,14.7085,42.106
66054,14.7268,42.0937
66054,14.6739,42.125
66054,14.7268,42.0937
66100,14.116,42.3301
66100,14.1405,42.3392
88067,16.431,38.7287
88068,16.5339,38.6899
88068,16.5499,38.685
88068,16.5419,38.6875
87076,16.4795,39.7905
87076,16.4743,39.8161
87100,16.2531,39.2989
87100,16.2944,39.2674
87100,16.3039,39.2709
87052,16.43,39.3449
87053,16.3399,39.3101
87054,16.3171,39.1784
file2.csv:
ABC,66100
"CDF",65125
"123",65125
1234,64100
0123,75025
lmn,85025
abc,88046
"Random",88068
"Raond2",87100
"Raondm3",87100
Raondom4,87054
现在我要做的是为row2[1]
中的每个file2.csv
,在row1[0]
中找到它的第一个匹配项,然后从中提取row1[1]
和row1[2]
与row2[0]
和row2[1]
一起插入的行,并将其写入另一个csv
文件。这是我为此编写的代码:
updated_list = []
with open("file1.csv","r") as in_file1, open("file2.csv", "r") as in_file2, open("file3.csv", "w", newline='') as out_file:
reader1 = csv.reader(in_file1)
reader2 = csv.reader(in_file2)
writer_final = csv.writer(out_file)
for row2 in reader2: #reader2 is for file2
for row1 in reader1:#reader1 is for file1
if str(row2[1].strip()) == str(row1[0].strip()):
print("Found match for {}".format(row2[1]))
updated_list.append([row2[0],row2[1],row1[1],row1[2]])
break
else:
continue
writer_final.writerows(updated_geo_list)
上面的代码能够匹配row2[1]
中的某些file2.csv
,但对于许多row1[0]
,即使存在,也无法与file1.csv
中的87100
匹配。例如,在上述示例数据中,尽管87054
包含两个值,但是代码无法将file2.csv
的{{1}}和file1.csv
匹配到file1.csv
。我虽然这些字符串中可能会有一些多余的空格,所以我也使用了split()
,但是它仍然无法正常工作。为什么没有进行匹配?
答案 0 :(得分:0)
在运行代码并将print
语句放在几个地方后,观察到仅对file2.csv
的第一个值进行了比较,即“ ABC,66100”。其余代码只是跳过。
那是因为csv.reader返回了一个作为迭代器的阅读器对象。
因此,一旦您迭代了整个对象,它就会变为空。
解决方法是,您需要将阅读器另存为列表,以便反复进行迭代。
修改线
reader1 = csv.reader(in_file1)
到
reader1 = list(csv.reader(in_file1))
应该给您想要的结果。
import csv
updated_list = []
with open("file1.csv","r") as in_file1, open("file2.csv", "r") as in_file2, open("file3.csv", "w", newline='') as out_file:
reader2 = csv.reader(in_file2)
reader1 = list(csv.reader(in_file1))
writer_final = csv.writer(out_file)
for row2 in reader2: #reader2 is for file2
for row1 in reader1: #reader1 is for file1
if str(row2[1].strip()) == str(row1[0].strip()):
updated_list.append([row2[0],row2[1],row1[1],row1[2]])
break
writer_final.writerows(updated_list)
cat file3.csv
ABC,66100,14.116,42.3301
Random,88068,16.5339,38.6899
Raond2,87100,16.2531,39.2989
Raondm3,87100,16.2531,39.2989
Raondom4,87054,16.3171,39.1784
注意
如果文件很大,将阅读器转换为列表可能会有害,因为这可能会影响内存。更好的选择是使用openpyxl
或将数据加载到pandas
数据框中并在那里进行操作。
答案 1 :(得分:0)
正如我的评论中提到的那样:文件对象是流,一旦经过某个点,您将无法再看到它-您需要将文件放入内存中,以将一个文件的所有行与另一个文件进行比较。
此代码将较小的文件读入内存并逐行处理较大的文件。
匹配文件较小的所有行的第一个匹配行请求者数据,然后将较小文件的行从内存中删除,因此与以后的文件不匹配:
创建文件:
with open("f1.txt","w") as f:
f.write("""66054,14.7065,42.1115
66054,14.7085,42.106
66054,14.7268,42.0937
66054,14.6739,42.125
66054,14.7268,42.0937
66100,14.116,42.3301
66100,14.1405,42.3392
88067,16.431,38.7287
88068,16.5339,38.6899
88068,16.5499,38.685
88068,16.5419,38.6875
87076,16.4795,39.7905
87076,16.4743,39.8161
87100,16.2531,39.2989
87100,16.2944,39.2674
87100,16.3039,39.2709
87052,16.43,39.3449
87053,16.3399,39.3101
87054,16.3171,39.1784""")
with open ("f2.txt","w") as f:
f.write("""ABC,66100
"CDF",65125
"123",65125
1234,64100
0123,75025
lmn,85025
abc,88046
"Random",88068
"Raond2",87100
"Raondm3",87100
Raondom4,87054""")
程序
import csv
d2 ={}
# smaller file: load in memory
with open("f2.txt") as f:
cr = csv.reader(f)
for row in cr:
# store under same key as list of rows to keep same order and
# allow multiple rows with same row[1] value
k = d2.setdefault(row[1],[])
k.append(row)
# process larger file
with open("f1.txt") as f, open("f3.txt","w",newline="") as nf:
cr = csv.reader(f)
writer = csv.writer(nf)
for row in cr:
if d2.get(row[0],[]):
for sl in d2.get(row[0]):
writer.writerow( (sl + [row[1],row[2]]) )
# remove from d2 so no reappearing rows will be written
del d2[row[0]]
with open("f3.txt") as f:
print(f.read())
输出:
ABC,66100,14.116,42.3301
Random,88068,16.5339,38.6899
Raond2,87100,16.2531,39.2989
Raondm3,87100,16.2531,39.2989
Raondom4,87054,16.3171,39.1784
只有来自file2的东西在file1中完全匹配。