使用计数交集创建迭代器

时间:2014-07-26 16:44:52

标签: python

我需要读取文本文件,计算特定分数,然后写一个新文件。

f=open("test.txt")
fw=open("test_result.txt","w")

此文件" f"包含三个分隔的内容,如下所示:

'Washington\t\5\tWaterpark,Themepark,Playground,Spaceneedle,Carousel\n'
'California\t\6\tWaterpark,Themepark,Disneyland,Legoland,Carousel,Sixflag\n'
'Arizona\t\3\tWaterpark,Playground,Themepark\n'

我想找到每行第三列中每个列表组合的交集中只能找到的项目数。

len(intersect_WAandCA) #3: 'Waterpark, Themepark, Carousel':intersection between 5 lists in first line and 6 lists in second line
len(intersect_WAandAZ) #3: 'Waterpark, Themepark, Playground'
len(intersect_CAandAZ) #2: 'Waterpakr, Playground'

由此,我想制作一个新文件如下。

5 Washington 6 California 3
5 Washington 3 Arizona 3
6 California 3 Arizona 2 

我试图通过来自itertools导入组合""来找出方法。比如in this question。说实话,我是Python新手。我找不到用循环创建迭代器的方法并创建一个新文件。实际上,我所拥有的文件包含超过100行。

我应该怎样做才能创建(n * n-1)/ 2所有组合?

1 个答案:

答案 0 :(得分:0)

您的输入文件是制表符分隔文件,我使用csv模块读取数据;使用set()然后创建第3列的集合:

import csv

with open('test.txt', 'rb') as infh:
    reader = csv.reader(infh, delimiter='\t')
    data = [(row[0], set(row[2].split(','))) for row in reader]

现在我们有了可以使用的数据;我们可以忽略第二列,相同的数字是我们的集合的长度。

from itertools import combinations

with open('test2.txt', 'wb') as outfh:
    writer = csv.writer(outfh, delimiter='\t')
    for (state1, features1), (state2, features2) in combinations(data, 2):
        overlap = len(features1 & features2)
        writer.writerow([
            len(features1), state1, 
            len(features2), state2,
            overlap])

这会产生:

>>> import csv
>>> data = '''\
... Washington\t5\tWaterpark,Themepark,Playground,Spaceneedle,Carousel
... California\t6\tWaterpark,Themepark,Disneyland,Legoland,Carousel,Sixflag
... Arizona\t3\tWaterpark,Playground,Themepark
... '''.splitlines(True)
>>> reader = csv.reader(data, delimiter='\t')
>>> data = [(row[0], set(row[2].split(','))) for row in reader]
>>> import sys
>>> writer = csv.writer(sys.stdout, delimiter='\t')
>>> for (state1, features1), (state2, features2) in combinations(data, 2):
...     overlap = len(features1 & features2)
...     writer.writerow([
...         len(features1), state1,
...         len(features2), state2,
...         overlap])
... 
5   Washington  6   California  3
5   Washington  3   Arizona 3
6   California  3   Arizona 2