从大型导入数据集开始,我正在尝试识别并打印与那里至少有2所独立学院/大学的城市相对应的每一行。
到目前为止(相关代码):
for line in file:
fields = line.split(",")
ID, name, city = fields[0], fields[1], fields[3]
count = line.count()
if line.count(city) >= 2:
if line.count(ID) < 2:
print "ID:", ID, "Name: ", name, "City: ", city
换句话说,我希望能够消除1)任何重复的学校列表(通过ID - 此文件有许多机构反复出现),2)任何没有两个或更多机构的城市。
谢谢!
答案 0 :(得分:0)
dicts会派上用场。在你的情况下,首先按城市,然后按ID索引的嵌套dicts应该可以解决问题。
# will hold cities[city][ID] = [ID, name, city]
cities = {}
for line in file:
fields = lines.split()
ID, name, city = fields
cities.setdefault(name, {})[ID] = fields
# 'cities' values are the IDs for that city. make a list if there are at least 2 ids
multi_schooled_cities = [ids_by_city.values() for ids_by_city in cities.values() if len(ids_by_city) >= 2]