Question

我想在CSV行中找到最常出现的子字符串，或者使用关键字列表进行查找。

我已经找到了一种方法，可以使用Python使用以下响应找出CSV文件每行中最常出现的前5个单词，但是，这并没有解决我的目的。它给我的结果如 -

[(' Trojan.PowerShell.LNK.Gen.2', 3),
(' Suspicious ZIP!lnk', 2),
(' HEUR:Trojan-Downloader.WinLNK.Powedon.a', 2),
(' TROJ_FR.8D496570', 2),
('Trojan.PowerShell.LNK.Gen.2', 1),
(' Trojan.PowerShell.LNK.Gen.2 (B)', 1),
(' Win32.Trojan-downloader.Powedon.Lrsa', 1),
(' PowerShell.DownLoader.466', 1),
(' malware (ai score=86)', 1),
(' Probably LNKScript', 1),
(' virus.lnk.powershell.a', 1),
(' Troj/LnkPS-A', 1),
(' Trojan.LNK', 1)]

然而，我想要一些类似于“Trojan＆＃39;”，“＃Downloader＆＃39;，＆＃39; Powershell＆＃39; ......作为最佳结果。

匹配的单词可以是CSV中值（单元格）的子字符串，也可以是两个或多个单词的组合。有人可以通过使用关键字列表或不使用来帮助解决此问题。

谢谢！

Answer 1

让my_values = ['A', 'B', 'C', 'A', 'Z', 'Z' ,'X' , 'A' ,'X','H','D' ,'A','S', 'A', 'Z']是您要排序的单词列表。

现在拿一个列表来存储每个单词出现的信息。

count_dict={}

使用适当的值填充字典：

for i in my_values:
    if count_dict.get(i)==None: #If the value is not present in the dictionary then this is the first occurrence of the value
        count_dict[i]=1
    else:
        count_dict[i] = count_dict[i]+1 #If previously found then increment it's value

现在根据dict的出现次数对其进行排序：

sorted_items= sorted(count_dict.items(),key=operator.itemgetter(1),reverse=True)

现在你有了预期的结果！最常出现的3个值是：

print(sorted_items[:3])

输出：

[('A', 5), ('Z', 3), ('X', 2)]

最常出现的2个值是：

print(sorted_items[:3])

输出：

[('A', 5), ('Z', 3)]

等等。

Python - 查找CSV行中发生的大多数单词

1 个答案: