Question

我有一个非常大的csv文件，有超过50K条目并且正在增加。我的文件有这样的结构：

    ID;name;battery;... 
    101;a,3.3;...
    102;b,3.3;...
    103;c,3.2;...

我知道如何在python中读取csv文件，但我想知道在csv文件中找到新条目以避免重写新行的最佳方法是什么。

我做的是这样的事情：

if new_id in open(log.csv).read():

任何帮助或建议都将受到高度赞赏。

编辑：我想按ID过滤

Answer 1

避免重复的好方法是使用针对搜索进行优化的特殊数据结构。例如，在Python中，您可以使用set()。 set()基于哈希表，并提供时间复杂度O(n)的搜索。架构如下：

将文件中的现有ID读入set()：

file = open('log.csv', 'rw')
# include only ids to set:
entries = set(i.split(',')[0] for i in file.readlines()[1:])

检查每个新行插入条件：

# new_entry - is a new line
new_id = new_entry.split(',')[0] # get new id
if new_id not in entries:
    file.write(new_entry)  # Maybe newline appending is needed
    entries.add(new_id) # Update a set of existing rows

在csv python中查找元素的最佳方法

1 个答案: