我有100K记录的主列表,每条记录都属于公司编号。我正在尝试通过选择每个公司编号中的2个来创建样本数据。 “def公司”运行数据并返回唯一的公司编号和每个公司编号的计数。
我的脚本只输出列表中第一家公司的2条记录并停止,我怎样才能拥有每家公司的循环输出2?
数据看起来像这样(列[0]是公司编号):
'54', '000054', '14571', ' 0000010023'
'54', '000054', '14571', ' 0000010033'
'4', '000054', '14571', ' 0000010024'
'4', '000054', '14571', ' 0000010023'
'433', '000054', '14571', ' 000001023423'
'433', '000054', '14571', ' 00000101563'
'433', '000054', '14571', ' 00000100234523'
'433', '000054', '14571', ' 00000100657823'
'433', '000054', '14571', ' 0000010SDF023'
'78', '000054', '14571', ' 000001002PIWEUR3'
'78', '000054', '14571', ' 00000100J23'
'78', '000054', '14571', ' 00000100222223'
'78', '000054', '14571', ' 000001002445'
'12', '000054', '14571', ' 0000010256'
'12', '000054', '14571', ' 000001005666'
import os
import sys
import csv
from collections import Counter
masterlist = open('P:/20140408.txt', 'rb')
data = csv.reader(masterlist, delimiter=",", quotechar='"')
def Company():
masterlist.seek(0)
cnt = Counter()
for row in data:
cnt[row[0]] +=1
return cnt
def maintest():
companylist = Company().keys()
masterlist.seek(0)
s = 2
for rows in data:
if rows[0] in companylist and s > 0:
print rows
s -=1
maintest()
答案 0 :(得分:1)
而不是带有计数器的东西,我会保留公司ID的简单映射 - >你走过循环时看到的次数:
seen = dict()
for row in data:
n = seen.setdefault(row[0], 0)
if n < 2:
print row
seen[row[0]] += 1
答案 1 :(得分:0)
如果您正在寻找一个真实的样本&#39;而不是前两个,如果你可以将所有数据保存在内存中,你可以这样做:
import csv
from collections import defaultdict
from random import sample
data=defaultdict(list)
with open('/tmp/data.csv') as f:
reader=csv.reader(f, skipinitialspace=True, quotechar="'")
for line in reader:
data[line[0]].append(line[1:])
for k in data:
print k, sample(data[k], 2)
将样本数据作为csv文件,打印:
54 [['000054', '14571', ' 0000010023'], ['000054', '14571', ' 0000010033']]
12 [['000054', '14571', ' 0000010256'], ['000054', '14571', ' 000001005666']]
78 [['000054', '14571', ' 000001002445'], ['000054', '14571', ' 00000100J23']]
4 [['000054', '14571', ' 0000010023'], ['000054', '14571', ' 0000010024']]
433 [['000054', '14571', ' 00000100234523'], ['000054', '14571', ' 000001023423']]