python:提取不同列表的项目并将它们放在一个集合中

时间:2015-02-14 20:16:35

标签: python list set

我有一个这样的文件:

93.93.203.11|["['vmit.it', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'maurominnella.com']"]
168.144.9.16|["['iipmalumni.com','webdesignhostingindia.com', 'iipmstudents.in', 'iipmclubs.in']"]
195.211.72.88|["['tcmpraktijk-jingshen.nl', 'ellen-siemer.nl'']"]
129.35.210.118|["['israelinnovation.co.il', 'watec-peru.com', 'bsacimeeting.org', 'wsava2015.com', 'picsmeeting.com']"]

我想在所有列表中提取域并将它们添加到一个集合中。最终,我想在一行中对每个独特的域名进行罚款。这是我写的代码:

set_d = set()
f = open(file,'r')
for line in f:
    line = line.strip('\n')
    ip,list = line.split('|')
    l = json.loads(list)
    for e in l:
        domain = e.split(',')
        set_d.add(domain)
        print set_d

但是它给出了以下错误:

    set_d.add(domain)
TypeError: unhashable type: 'list'

有人可以帮帮我吗?

3 个答案:

答案 0 :(得分:1)

您应该拨打update而不是add;

set_d.update(domain)

实施例

>>> set_d = {'a', 'b', 'c'}
>>> set_d.update(['c', 'd', 'e'])
>>> print set_d
{'a', 'b', 'c', 'd', 'e'}

答案 1 :(得分:1)

使用str.translate清除文本并使用update:

添加到集合中
set_d = set()
with open(file,'r') as f:
    for line in f:
       lst = (x.strip() for x in line.split("|")[1].translate(None,"\"'[]").split(","
        set_d.update(lst)

输出一组独特的域名:

set(['vmit.it', 'tcmpraktijk-jingshen.nl', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'watec-peru.com', 'bsacimeeting.org', 'webdesignhostingindia.com', 'wsava2015.com', 'iipmstudents.in', 'maurominnella.com', 'ellen-siemer.nl', 'picsmeeting.com', 'iipmalumni.com', 'iipmclubs.in', 'israelinnovation.co.il'])

您可以写入新文件:

set_d = set()
with open(file,'r') as f,open("out.txt","w") as out:
    for line in f:
        lst = (x.strip() for x in line.split("|")[1].translate(None,"\"'[]").split(","))
        set_d.update(lst)
    for line in set_d:
        out.write("{}\n".format(line))

输出:

$ cat out.txt 
vmit.it
tcmpraktijk-jingshen.nl
umbertominnella.it
studioguizzardi.it
telestreet.it
watec-peru.com
bsacimeeting.org
webdesignhostingindia.com
wsava2015.com
iipmstudents.in
maurominnella.com
ellen-siemer.nl
picsmeeting.com
iipmalumni.com
iipmclubs.in
israelinnovation.co.il

您的代码不会分成单独的域,您的json调用实际上没有任何帮助。将代码更改为更新将输出如下内容:

{" 'maurominnella.com']", " 'wsava2015.com'", "'webdesignhostingindia.com'", " 'iipmclubs.in']", " 'ellen-siemer.nl'']", " 'umbertominnella.it'", " 'picsmeeting.com']", "['israelinnovation.co.il'", "['vmit.it'", " 'iipmstudents.in'", "['tcmpraktijk-jingshen.nl'", " 'studioguizzardi.it'", "['iipmalumni.com'", " 'watec-peru.com'", " 'bsacimeeting.org'", " 'telestreet.it'"}

也不要使用list作为变量名,否则它会隐藏python list

答案 2 :(得分:0)

由于split函数的结果是列表(domain = e.split(','))并且列表不可删除,因此您无法将它们添加到set。相反,您可以使用set.update()将这些元素添加到您的集合中,但是您不需要Json,因为它不会将您的域名分开并且不会为您提供所需的结果而是可以使用ast.literal_eval分割您的列表:

import ast
set_d = set()
f = open(file,'r')
for line in f:
    line = line.strip('\n')
    ip,li = line.split('|')
    l = ast.literal_eval(ast.literal_eval(li)[0])
    for e in l:
        domain = e.split(',')
        set_d.update(domain)
    print set_d

请注意,不要使用python内置函数或类型作为变量!

作为一种更有效的方法,您可以使用正则表达式来对您的域进行grub:

f = open(file,'r').read()
import re
print set(re.findall(r'[a-zA-Z\-]+\.[a-zA-Z]+',f))

结果:

set(['vmit.it', 'tcmpraktijk-jingshen.nl', 'umbertominnella.it', 'studioguizzardi.it', 'telestreet.it', 'israelinnovation.co', 'bsacimeeting.org', 'webdesignhostingindia.com', 'iipmstudents.in', 'maurominnella.com', 'ellen-siemer.nl', 'picsmeeting.com', 'watec-peru.com', 'iipmalumni.com', 'iipmclubs.in'])
[Finished in 0.0s]