Question

如何基于两列从csv文件中删除重复行，其中一列使用正则表达式确定匹配，并按第一个字段（IPAddress）进行分组。最后在行中添加一个count字段来计算重复的行：

csv文件：

IPAddress, Value1, Value2, Value3
127.0.0.1, Test1ABC, 10, 20
127.0.0.1, Test2ABC, 20, 30
127.0.0.1, Test1ABA, 30, 40
127.0.0.1, Value1BBA, 40, 50
127.0.0.1, Value1BBA, 40, 50
127.0.0.2, Test1ABC, 10, 20
127.0.0.2, Value1AAB, 20, 30
127.0.0.2, Value2ABA, 30, 40
127.0.0.2, Value1BBA, 40, 50

我希望在IPAddress和Value1上匹配（如果前5个字符匹配，则Value1匹配）。

这会给我：

IPAddress, Value1, Value2, Value3, Count
127.0.0.1, Test1ABC, 10, 20, 2
127.0.0.1, Test2ABC, 20, 30, 1
**127.0.0.1, Test1ABA, 30, 40** (Line would be removed but counted)
127.0.0.1, Value1BBA, 40, 50, 2
**127.0.0.1, Value1BBA, 40, 50** (Line would be removed but counted)
127.0.0.2, Test1ABC, 10, 20, 1
127.0.0.2, Value1AAB, 20, 30, 2
127.0.0.2, Value2ABA, 30, 40, 1
**127.0.0.2, Value1BBA, 40, 50** (Line would be removed but counted)

新输出：

IPAddress, Value1, Value2, Value3, Count
127.0.0.1, Test1ABC, 10, 20, 2
127.0.0.1, Test2ABC, 20, 30, 1
127.0.0.1, Value1BBA, 40, 50, 2
127.0.0.2, Test1ABC, 10, 20, 1
127.0.0.2, Value1AAB, 20, 30, 2
127.0.0.2, Value2ABA, 30, 40, 1

我尝试过使用套装，但显然无法索引一套。

entries = set()
writer=csv.writer(open('myfilewithoutduplicates.csv', 'w'), delimiter=',')
    for row in list:
    key = (row[0], row[1])
        if re.match(r"(Test1)", key[1]) not in entries:
        entries.add(key)

伪代码？：

# I want to iterate through rows of a csv file and
if row[0] and row[1][:5] match a previous entry:
    remove row
    add count
else:
    add row

非常感谢任何帮助或指导。

Answer 1

您需要一本字典来跟踪匹配项。您不需要正则表达式，只需要跟踪前5个字符。按行'键'存储行，由第一列和第二列的前5个字符组成，并添加计数。您需要先计算，然后写出收集的行和计数。

如果订购很重要，您可以用collections.OrderedDict()替换字典，但代码是相同的：

rows = {}

with open(inputfilename, 'rb') as inputfile:
    reader = csv.reader(inputfile)
    headers = next(reader)  # collect first row as headers for the output
    for row in reader:
        key = (row[0], row[1][:5])
        if key not in rows:
            rows[key] = row + [0,]
        rows[key][-1] += 1  # count

with open('myfilewithoutduplicates.csv', 'wb') as outputfile:
    writer = csv.writer(outputfile)
    writer.writerow(headers + ['Count'])
    writer.writerows(rows.itervalues())

Answer 2

您可以使用numpy：

import numpy as np

# import data from file (assume file called a.csv), store as record array:
a  = np.genfromtxt('a.csv',delimiter=',',skip_header=1,dtype=None)

# get the first column and first 5 chars of 2nd col store in array p
p=[x+y for x,y in zip(a['f0'],[a['f1'][z][0:6] for z in range(len(a))])]

#compare elements in p, get indexes of unique entries (m)
k,m = np.unique(p, return_index=True)

# use indexes to create new array without dupes
newlist = [a[v] for v in m]

#the count is the difference in lengths of the arrays
count = len(a)-len(newlist)

使用Python中的regex从2列中删除csv文件中的重复行

2 个答案: