我有一个大格式的.csv文件:
" String 1"," String 2"," String 3"," String 4","字符串5","字符串6"等
我感兴趣的是从列中提取信息,只要它链接到下一列。
为了给出一个更清晰的例子,假设第3和第4列由团队组成,它们代表他们所在的人(第3组是本地团队)。
"首先","结果","费城","迈阿密"等 "第二","结果","达拉斯","克利夫兰"等 "第三","结果","迈阿密","克利夫兰"等 "第四","结果","克利夫兰","迈阿密"等 "第五","结果","达拉斯","费城"等 "第六","结果","克利夫兰","达拉斯"等 "第七","结果","迈阿密","费城"等 "第八","结果","费城","迈阿密"等 "第九","结果","克利夫兰","迈阿密"等
我想获得一个由他们主持的团队组成的列表,而不是重复
Cleveland hosts
Dallas
Miami
Dallas hosts
Cleveland
Philadelphia
Miami hosts
Cleveland
Philadelphia
Philadelphia hosts
Miami
在那之后,我想在文件中写下关于这两种模式的所有行,这就是说,如果我想看看克利夫兰和迈阿密之间的比赛,我想有一个像这样的.csv,
"第三","结果","迈阿密","克利夫兰"等 "第四","结果","克利夫兰","迈阿密"等 "第九","结果","克利夫兰","迈阿密"等
使用以下代码,我设法读取一列并将所有唯一元素存储在字典中,以便我可以稍后从中选择一个单词。我可以使用第4列进行相同的操作,并通过将参数Wanted_Column的值更改为3来重复代码
import csv
from collections import Counter, defaultdict, OrderedDict
Var = 1
Wanted_Column = 2 # Col I want to analyze
with open('file.csv', "rb") as inputfile:
data = csv.reader(inputfile)
seen = defaultdict(set)
countd = Counter(
row[Wanted_Column]
for row in data
if row[Wanted_Column] and row[Wanted_Column] not in seen[row[Var]] and not seen[row[Var]].add(row[Wanted_Column])
)
y = OrderedDict(sorted(countd.items(), key = lambda t: t[0]))
for line in y:
print line
结果是,
Cleveland
Dallas
Miami
Philadelphia
所以,我的问题是,我应该添加什么来获得双重条件并以我暴露的方式显示元素?
之后,要在另一个文件中写行,我已经获得了这段代码,
look_for = set([ELEMENT IN DICTIONARY])
with open('file.csv','rb') as inf, open('output_file.csv','wb') as outf:
incsv = csv.reader(inf, delimiter=',')
outcsv = csv.writer(outf, delimiter=',')
outcsv.writerows(row for row in incsv if row[Wanted_column] in look_for)
只有一个元素它运作良好,但当然,由于之前的条件没有明确定义,我不知道应该改变什么来获得我想要的结果。
答案 0 :(得分:2)
你能用一套套词吗?
f = open('test.csv')
hosts = {}
#read
for line in f:
line = line.replace('"', '')
res = line.split(',')
if not hosts.get(res[2]):
hosts[res[2]] = set([])
hosts.get(res[2]).add(res[3])
#print
for key in sorted(hosts.keys()):
print 'HOST', key
for guest in sorted(list(hosts[key])):
print 'GUEST', guest
print hosts
然后最后的打印将循环遍历hosts
键并打印该组的内容。
如果列数不是预先知道但你知道它是host, guest
那么它只是一个内部循环从位置2开始遍历整行。
添加了最后一行以显示已分页的打印。此脚本中的输入与您的输入之间的唯一区别是我删除了.etc
列并假设输入在那里停止。扩展这应该是微不足道的
答案 1 :(得分:1)
您可以使用集词典来跟踪托管团队和独特的访问团队。这是一个例子。
import csv
# load the csv file
rows = [r for r in csv.reader(file('sample.csv','r'))]
# order preservation list
preserve_order = []
# track the schedule from the hosting team's point of view
hosting_teams = {}
# change the wanted column here
wanted_column = 3
for row in rows:
# strip out the double quotes
row = [c.replace('"','') for c in row]
the_host = row[2]
the_order = row[0]
preserve_order.append(the_order)
# create a dictionary with a unique set of visiting teams
host_schedule = hosting_teams.setdefault(the_host,set([]))
# add the team visit
visiting_team = row[wanted_column]
host_schedule.add((visiting_team,the_order))
output = []
for hosting_team,host_schedule in hosting_teams.items():
for visiting_team,the_order in host_schedule:
output.append([the_order,"Result",hosting_team,visiting_team])
output.sort(key=lambda x:preserve_order.index(x[0]))
csv.writer(file('output.csv','wb')).writerows(output)