循环遍历文本文件中的每一行以提取唯一列表

时间:2014-10-24 11:03:09

标签: python python-2.7

我一直在尝试从文本文件中提取一个唯一的数据名列表,但我似乎无法这样做,因为我不了解正则表达式。

如果我们有例子:

[Friday 17/10/2014 @ 07:30:55] The user user01 | account01 | namename1 has been granted access.
[Friday 17/10/2014 @ 07:30:57] The user user two | account_two | name2 has been granted access.
[Friday 17/10/2014 @ 07:30:59] The user user_three | account_ | name3 here3 has been granted access.
[Friday 17/10/2014 @ 07:31:41] The user user01 | account01 | namename1 has been granted access.

我希望它基本上找到两个管道|之间的帐户信息,并删除管道和空格,以便在经过并删除之后将列表输出到只有以下内容的文本文件中任何重复,所以它是严格的纯列表

account01
account_two
account_

必须进行的一项检查是确保只有在该行包含短语has been granted access.时才获取帐户信息,因为数据可能如下所示:

[Friday 17/10/2014 @ 07:30:55] The user user01 | account01 | namename1 has been granted access.
[Friday 17/10/2014 @ 07:30:57] The user user two | account_two | name2 has been granted access.
[Friday 17/10/2014 @ 07:30:59] Details Granted | user two | access number 01239
[Friday 17/10/2014 @ 07:30:59] The user user_three | account_ | name3 here3 has been granted access.
[Friday 17/10/2014 @ 07:31:41] The user user01 | account01 | namename1 has been granted access.

我不希望它从该示例的第3行获取帐户信息user two

任何人都可以帮助解决一些代码示例吗?非常感谢。

3 个答案:

答案 0 :(得分:2)

>>> granted_accounts = [line.split('|')[1].strip() for line in open('file.txt') if 'has been granted access' in line]
>>> print(granted_accounts)
['account01', 'account_two', 'account_', 'account01']

如果你想在命令行上执行它,只需将带有shebang的两行放在.py文件中(search.py​​):

#!/usr/bin/env python
granted_accounts = [line.split('|')[1].strip() for line in open('file.txt') if 'has been granted access' in line]
print(granted_accounts)

然后这样跑:

$ python search.py

或:

$ chmod +x search.py
$ ./search.py

如果您有很多帐户,您可能只想打印一个帐户并在另一行上打印:

>>> granted_accounts = [line.split('|')[1].strip() for line in open('file.txt') if 'has been granted access' in line]
>>> print('\n'.join(sorted(set(granted_accounts))))
account01
account_
account_two

答案 1 :(得分:1)

def get_granted_accounts(filename):
    with open(filename) as f:
      return set(
               s.split('|')[1].strip() 
               for s in f.readlines() 
               if "has been granted access" in s) 

这段代码可以解决一些问题:

  • 管道不能出现在第一个或第二个字段(引用,转义)
  • "已被授予访问权限"应仅出现在预期字段中(例如,不作为帐户名称)

答案 2 :(得分:0)

我完全忽略了分裂...但这里是一个基于使用分割的完全正常工作的版本:

|分割并选择拆分的第二部分,然后删除所有空格,然后通过检查帐户是否在列表中以这种方式删除重复项来生成帐户列表< / p>

最后但并非最不重要的是,然后将所有帐户输出到output.txt

accountlist = []
with open('mydatafile.txt', 'r') as infile: 
    for line in infile:
        if "has been granted access." in line:
            if line.strip().split('|')[1].strip(" ") not in accountlist:
                accountlist.append(line.strip().split('|')[1].strip(" "))
    print accountlist

    with open('output.txt', 'w') as outfile:
        for account in accountlist:
            outfile.write("%s\n" % account)