我有一个来自网络捕获的.csv文件。在这个文件中,我需要识别重复的消息。
发件人A,收件人G,43,信息...
发件人H,接收者R,43,信息...
发件人A,接收者G,27,信息...
发件人N,接收者Z,43,信息...
发件人A,接收者G,1367,信息...
发件人R,接收者P,43,信息...
发件人A,接收者G,43,信息...
发件人H,接收者R,111,信息......
重复参数是标识符,但这并不一定意味着重复该消息。在这种情况下,我还需要检查发件人和接收者。我想过按照第3列对文件进行排序,然后从上到下循环,同时比较这些列中的值。 虽然我已经设法在文件中隔离重复数字的行,但我的问题来了 首先,我无法正确订购 其次,我不知道如何阅读并同时将一列(或我的情况下为两列)与下面的值进行比较。我认为这个想法将包含一个anidated if (如果row [2] == row [2,line below],那么检查row [0]和row [1]是否= =行[0和1] ,下面的行]),但经过很长一段时间的思考,我没有设法创造任何体面的比较。
这个想法是打印或保存案例,同时重复这三个条件(基本上是前三列)。
发件人A,收件人G,43,信息...
发件人A,接收人G,43,信息......
也许我让它变得太复杂,而且方法更简单或更快捷。无论如何,我发布我的代码,如果有人帮助我会很感激。问候
entries = []
duplicated = []
with open('file.csv', 'rt') as my_file:
for line in my_file:
columns = line.strip().split(',')
if columns[2] not in entries:
entries.append(columns[2])
else:
duplicated.append(columns[2])
#List with duplicated=null->no error
if duplicated==[]:
print "\nNo duplicated\n"
#Other case, there might be duplicates
else:
#Store error cases in New.csv
with open('New.csv', 'w') as out_file:
with open('file.csv', 'r') as my_file:
for line in my_file:
columns = line.strip().split(',')
if columns[2] in duplicate_entries:
out_file.write(line)
#TO SORT THE EXCEL FILE. CURRENTLY NOT WORKING PROPERLY
## data = csv.reader(open('Other.csv'),delimiter=',')
## sortedlist = sorted(data, key=operator.itemgetter(2), reverse=True)
## with open('Other.csv', 'w') as out_file:
## for item in sortedlist:
## out_file.write(item)
答案 0 :(得分:2)
实际上没有必要对文件进行排序,但是你的排序可能与排序字符串与数字有关;字符串按字典顺序排序,这意味着'10'
在 '2'
之前排序,因为1
在字符集中较早出现而0
未出现在import csv
from collections import defaultdict
seen = defaultdict(list)
with open('file.csv', 'rb') as my_file:
reader = csv.reader(my_file)
for row in reader:
key = (row[0], row[1], row[2]) # sender, receiver, id
seen[key].append(row)
with open('new.csv', 'wb') as outf:
writer = csv.writer(outf)
for collected in seen.values():
if len(collected) > 1:
writer.writerows(collected)
中播放。
您可以通过将重复项存储在字典中来跟踪重复项;这可以让你查找以前看过的比赛。使用collections.defaultdict()
:
import csv
from collections import Counter
with open('file.csv', 'rb') as my_file:
reader = csv.reader(my_file)
counts = Counter((r[0], r[1], r[2]) for r in reader)
with open('new.csv', 'wb') as outf:
writer = csv.writer(outf)
for (sender, receiver, id), count in counts.most_common():
writer.writerow([sender, receiver, id, count])
此版本通过(sender,receiver,id)三元组对输入CSV中的行进行分组,然后再次写出所有行,但前提是每个三元组有多行。
你也可以保持计数;计算你在字典中看到三胞胎的频率; a collections.Counter()
会使事情变得简单,然后按频率提供排序:
>>> import csv
>>> from collections import defaultdict
>>> sample = '''\
... Sender A,Receiver G,43,Info...
... Sender H,Receiver R,43,Info...
... Sender A,Receiver G,27,Info...
... Sender N,Receiver Z,43,Info...
... Sender A,Receiver G,1367,Info...
... Sender R,Receiver P,43,Info...
... Sender A,Receiver G,43,Info...
... Sender H,Receiver R,111,Info...
... '''.splitlines(True)
>>> seen = defaultdict(list)
>>> reader = csv.reader(sample)
>>> for row in reader:
... key = (row[0], row[1], row[2]) # sender, receiver, id
... seen[key].append(row)
...
>>> import sys
>>> writer = csv.writer(sys.stdout)
>>> for collected in seen.values():
... if len(collected) > 1:
... writer.writerows(collected)
...
Sender A,Receiver G,43,Info...
Sender A,Receiver G,43,Info...
使用您的样本数据进行演示:
Counter
或>>> from collections import Counter
>>> reader = csv.reader(sample)
>>> counts = Counter((r[0], r[1], r[2]) for r in reader)
>>> writer = csv.writer(sys.stdout)
>>> for (sender, receiver, id), count in counts.most_common():
... writer.writerow([sender, receiver, id, count])
...
Sender A,Receiver G,43,2
Sender A,Receiver G,1367,1
Sender A,Receiver G,27,1
Sender N,Receiver Z,43,1
Sender H,Receiver R,111,1
Sender H,Receiver R,43,1
Sender R,Receiver P,43,1
方法:
{{1}}
答案 1 :(得分:1)
Martijn Pieters向您展示了“纯粹”Python的非常好的解决方案
我向你展示了一些不同的东西 - 例如pandas
模块
(我使用StringIO
来模拟文件读取)
data = """Sender A,Receiver G,43,Info...
Sender H,Receiver R,43,Info...
Sender A,Receiver G,27,Info...
Sender N,Receiver Z,43,Info...
Sender A,Receiver G,1367,Info...
Sender R,Receiver P,43,Info...
Sender A,Receiver G,43,Info...
Sender H,Receiver R,111,Info..."""
import pandas as pd
from StringIO import StringIO
# read all file
df = pd.read_csv(StringIO(data), index_col=None, header=None)
print df
# group rows by values in columns 0, 1, 2
for name, group in df.groupby([0,1,2]):
print '\n', '-'*40, '\n'
print 'name:', name
print 'len:', len(group)
print
print group
if len(group) > 1:
# append (`mode='a'`) data to `results.csv`
group.to_csv('results.csv', mode='a', header=False, index=False)
#group.to_csv('results.csv', mode='a', header=False)
我使用pd.read_csv()
来阅读所有文件
(我假设文件header=None
中没有带标题的行
并且我不使用任何列作为行索引器index_col=None
)
然后我按行0,1,2中的值对行进行分组(并打印出来)
如果任何组有多个元素,我会将其附加到文件'results.csv'
。
我用
获取文件Sender A,Receiver G,43,Info...
Sender A,Receiver G,43,Info...
或者如果我在`to_csv()中不使用index=False
,我也会得到行号(索引)
0,Sender A,Receiver G,43,Info...
6,Sender A,Receiver G,43,Info...
这就是我在屏幕上打印的内容
0 1 2 3
0 Sender A Receiver G 43 Info...
1 Sender H Receiver R 43 Info...
2 Sender A Receiver G 27 Info...
3 Sender N Receiver Z 43 Info...
4 Sender A Receiver G 1367 Info...
5 Sender R Receiver P 43 Info...
6 Sender A Receiver G 43 Info...
7 Sender H Receiver R 111 Info...
----------------------------------------
name: ('Sender A', 'Receiver G', 27)
len: 1
0 1 2 3
2 Sender A Receiver G 27 Info...
----------------------------------------
name: ('Sender A', 'Receiver G', 43)
len: 2
0 1 2 3
0 Sender A Receiver G 43 Info...
6 Sender A Receiver G 43 Info...
----------------------------------------
name: ('Sender A', 'Receiver G', 1367)
len: 1
0 1 2 3
4 Sender A Receiver G 1367 Info...
----------------------------------------
name: ('Sender H', 'Receiver R', 43)
len: 1
0 1 2 3
1 Sender H Receiver R 43 Info...
----------------------------------------
name: ('Sender H', 'Receiver R', 111)
len: 1
0 1 2 3
7 Sender H Receiver R 111 Info...
----------------------------------------
name: ('Sender N', 'Receiver Z', 43)
len: 1
0 1 2 3
3 Sender N Receiver Z 43 Info...
----------------------------------------
name: ('Sender R', 'Receiver P', 43)
len: 1
0 1 2 3
5 Sender R Receiver P 43 Info...