我需要读取文本文件,计算特定分数,然后写一个新文件。
f=open("test.txt")
fw=open("test_result.txt","w")
此文件" f"包含三个分隔的内容,如下所示:
'Washington\t\5\tWaterpark,Themepark,Playground,Spaceneedle,Carousel\n'
'California\t\6\tWaterpark,Themepark,Disneyland,Legoland,Carousel,Sixflag\n'
'Arizona\t\3\tWaterpark,Playground,Themepark\n'
我想找到每行第三列中每个列表组合的交集中只能找到的项目数。
len(intersect_WAandCA) #3: 'Waterpark, Themepark, Carousel':intersection between 5 lists in first line and 6 lists in second line
len(intersect_WAandAZ) #3: 'Waterpark, Themepark, Playground'
len(intersect_CAandAZ) #2: 'Waterpakr, Playground'
由此,我想制作一个新文件如下。
5 Washington 6 California 3
5 Washington 3 Arizona 3
6 California 3 Arizona 2
我试图通过来自itertools导入组合""来找出方法。比如in this question。说实话,我是Python新手。我找不到用循环创建迭代器的方法并创建一个新文件。实际上,我所拥有的文件包含超过100行。
我应该怎样做才能创建(n * n-1)/ 2所有组合?
答案 0 :(得分:0)
您的输入文件是制表符分隔文件,我使用csv
模块读取数据;使用set()
然后创建第3列的集合:
import csv
with open('test.txt', 'rb') as infh:
reader = csv.reader(infh, delimiter='\t')
data = [(row[0], set(row[2].split(','))) for row in reader]
现在我们有了可以使用的数据;我们可以忽略第二列,相同的数字是我们的集合的长度。
from itertools import combinations
with open('test2.txt', 'wb') as outfh:
writer = csv.writer(outfh, delimiter='\t')
for (state1, features1), (state2, features2) in combinations(data, 2):
overlap = len(features1 & features2)
writer.writerow([
len(features1), state1,
len(features2), state2,
overlap])
这会产生:
>>> import csv
>>> data = '''\
... Washington\t5\tWaterpark,Themepark,Playground,Spaceneedle,Carousel
... California\t6\tWaterpark,Themepark,Disneyland,Legoland,Carousel,Sixflag
... Arizona\t3\tWaterpark,Playground,Themepark
... '''.splitlines(True)
>>> reader = csv.reader(data, delimiter='\t')
>>> data = [(row[0], set(row[2].split(','))) for row in reader]
>>> import sys
>>> writer = csv.writer(sys.stdout, delimiter='\t')
>>> for (state1, features1), (state2, features2) in combinations(data, 2):
... overlap = len(features1 & features2)
... writer.writerow([
... len(features1), state1,
... len(features2), state2,
... overlap])
...
5 Washington 6 California 3
5 Washington 3 Arizona 3
6 California 3 Arizona 2