计算文本文件中单词的出现次数

时间:2020-07-01 18:05:56

标签: python counter

我有一个TXT和CSV文件,其中也尝试登录用户名和其他信息,但是我想计算在这种情况下某些用户名尝试了多少次,我想计算一下此处使用的每个单词有多少个示例: <hostname> = 12ssh2 = 6,除外。

python脚本将是完美的

示例(关键信息已更改为“ ip”和“东西”):

sshd|XXX.XX.XX.XXX|1587574870|{"matches": ["Apr 22 18:53:46 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 18:53:48 <hostname> sshd[****]: Failed password for invalid user pengjing from XXX.XX.XX.XXX port **** ssh2", "Apr 22 18:55:14 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 18:55:15 <hostname> sshd[****]: Failed password for invalid user git from XXX.XX.XX.XXX port **** ssh2", "Apr 22 18:56:42 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 18:56:44 <hostname> sshd[****]: Failed password for invalid user test from XXX.XX.XX.XXX port **** ssh2", "Apr 22 18:58:14 <hostname> sshd[****]: Failed password for root from XXX.XX.XX.XXX port **** ssh2", "Apr 22 18:59:44 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 18:59:46 <hostname> sshd[****]: Failed password for invalid user za from XXX.XX.XX.XXX port **** ssh2", "Apr 22 19:01:09 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 19:01:10 <hostname> sshd[****]: Failed password for invalid user yw from XXX.XX.XX.XXX port **** ssh2"], "failures": 18, "mlfid": " <hostname> sshd[****]: ", "user": "root", "ip4": "XXX.XX.XX.XXX"}```

2 个答案:

答案 0 :(得分:0)

以下是如何使用str.count()方法的方法:

s = """sshd|XXX.XX.XX.XXX|1587574870|{"matches": ["Apr 22 18:53:46 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 18:53:48 <hostname> sshd[****]: Failed password for invalid user pengjing from XXX.XX.XX.XXX port **** ssh2", "Apr 22 18:55:14 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 18:55:15 <hostname> sshd[****]: Failed password for invalid user git from XXX.XX.XX.XXX port **** ssh2", "Apr 22 18:56:42 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 18:56:44 <hostname> sshd[****]: Failed password for invalid user test from XXX.XX.XX.XXX port **** ssh2", "Apr 22 18:58:14 <hostname> sshd[****]: Failed password for root from XXX.XX.XX.XXX port **** ssh2", "Apr 22 18:59:44 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 18:59:46 <hostname> sshd[****]: Failed password for invalid user za from XXX.XX.XX.XXX port **** ssh2", "Apr 22 19:01:09 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 19:01:10 <hostname> sshd[****]: Failed password for invalid user yw from XXX.XX.XX.XXX port **** ssh2"], "failures": 18, "mlfid": " <hostname> sshd[****]: ", "user": "root", "ip4": "XXX.XX.XX.XXX"}"""

print(s.count('ssh2'))
print(s.count('<hostname>'))

输出:

6
12


更新:

from collections import Counter
from re import findall

with open('file.txt', 'r') as f:
    print(Counter(findall('(?<=Failed password for invalid user ).*(?= from XXX\.XX\.XX\.XXX port \*\*\*\* ssh2)', f.read())))

输出:

Counter({'pengjing': 1,
         'git': 1,
         'test': 1,
         'za': 1,
         'yw': 1})

答案 1 :(得分:0)

将此逻辑附加到您的代码中。读取文件后,它将起作用。 str变量应替换为您拥有的变量。还必须处理文本并删除不必要的关键字,例如双引号,方括号,逗号等。您可以添加更多内容。

with open('input_file.txt', 'r') as file:
    str = file.read()

# str = """sshd|XXX.XX.XX.XXX|1587574870|{"matches": ["Apr 22 18:53:46 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 18:53:48 <hostname> sshd[****]: Failed password for invalid user pengjing from XXX.XX.XX.XXX port **** ssh2", "Apr 22 18:55:14 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 18:55:15 <hostname> sshd[****]: Failed password for invalid user git from XXX.XX.XX.XXX port **** ssh2", "Apr 22 18:56:42 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 18:56:44 <hostname> sshd[****]: Failed password for invalid user test from XXX.XX.XX.XXX port **** ssh2", "Apr 22 18:58:14 <hostname> sshd[****]: Failed password for root from XXX.XX.XX.XXX port **** ssh2", "Apr 22 18:59:44 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 18:59:46 <hostname> sshd[****]: Failed password for invalid user za from XXX.XX.XX.XXX port **** ssh2", "Apr 22 19:01:09 <hostname> sshd[****]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=XXX.XX.XX.XXX", "Apr 22 19:01:10 <hostname> sshd[****]: Failed password for invalid user yw from XXX.XX.XX.XXX port **** ssh2"], "failures": 18, "mlfid": " <hostname> sshd[****]: ", "user": "root", "ip4": "XXX.XX.XX.XXX"} """

word_dict = {}
for k in str.split(" ") : word_dict[k.replace('"','').replace("]","").replace(",","")] = 0
print(word_dict)
# {'sshd|XXX.XX.XX.XXX|1587574870|{matches:': 0, '[Apr': 0, '22': 0, '18:53:46': 0, '<hostname>': 0, 'sshd[****:': 0, 'pam_unix(sshd:auth):': 0, 'authentication': 0, 'failure;': 0, 'logname=': 0, 'uid=0': 0, 'euid=0': 0, 'tty=ssh': 0, 'ruser=': 0, 'rhost=XXX.XX.XX.XXX': 0, 'Apr': 0, '18:53:48': 0, 'Failed': 0, 'password': 0, 'for': 0, 'invalid': 0, 'user': 0, 'pengjing': 0, 'from': 0, 'XXX.XX.XX.XXX': 0, 'port': 0, '****': 0, 'ssh2': 0, '18:55:14': 0, '18:55:15': 0, 'git': 0, '18:56:42': 0, '18:56:44': 0, 'test': 0, '18:58:14': 0, 'root': 0, '18:59:44': 0, '18:59:46': 0, 'za': 0, '19:01:09': 0, '19:01:10': 0, 'yw': 0, 'failures:': 0, '18': 0, 'mlfid:': 0, '': 0, 'user:': 0, 'ip4:': 0, 'XXX.XX.XX.XXX}': 0}

for i in word_dict.keys() :
    counter = 0
    for j in str.split(" ") :
        # print(j)
        if j.__contains__(i) :
            counter +=1
    word_dict[i] = counter

print(word_dict["ssh2"])
# 6
print(word_dict["<hostname>"])
# 12

for k, v in word_dict.items() :
  print("Word : ", k , "  Occurences : ",v)

# Word :  sshd|XXX.XX.XX.XXX|1587574870|{matches:   Occurences :  0
# Word :  [Apr   Occurences :  0
# Word :  22   Occurences :  22
# Word :  18:53:46   Occurences :  2
# Word :  <hostname>   Occurences :  24
# Word :  sshd[****:   Occurences :  0
# Word :  pam_unix(sshd:auth):   Occurences :  10
# Word :  authentication   Occurences :  10
# .
# .
# .
相关问题