我有一个(大)目录CSV,其列为[0:3] =电话号码,姓名,城市,州。
我创建了一个包含20,000个条目的随机样本,但当然,它对人口较多的州和城市进行了大量加权。
我如何编写一个python代码(使用CSV或Pandas - 请不要使用linecache)同样优先考虑/加权每个独特城市和每个州(单独,不是一对),并将每个独特城市限制为3个选择?
TRICKIER想法:我如何编写一个python代码,以便对于每个被挑选的随机行,它会检查之前是否已经选择了该城市。如果之前已经选择了该城市,它会忽略它并再次选择一条随机线,将该城市之前考虑的选择数减少一个。所以说,它随机选择了圣安东尼奥马刺队,之前已经两次被选中。该脚本忽略此选择,将其放回列表中,减少当前考虑的圣安东尼奥选择的数量,然后再次随机选择一行。如果它再次从圣安东尼奥挑选一条线,那么它会重复前一个过程,现在减少考虑的圣安东尼奥选择为0.所以它必须连续三次选择圣安东尼奥,以增加圣安东尼奥的另一条线。对于未来的选秀权,它必须连续四次选择圣安东尼奥马刺队,并且每增加一个选秀权一次。
我不知道第二个选项如何能够“分散”我的随机选择 - 这只是一个想法,它看起来像一个学习更多pythonese的有趣方式。我们将非常感谢沿着同一思路的任何其他想法。我们也欢迎对统计抽样和样本散布的见解。
答案 0 :(得分:2)
我可能误解了你想要做的事情。
我认为你想要的东西有点复杂。我不太明白你的问题,但希望这个例子能给你一些思考的东西。
但是,您可能希望使用各种库进行采样。总而言之,您可以使用pandas
:
# Group by city, state
groups = df.groupby(['state', 'city'])
# Then get a result with n from each unique city,state
def choose_n(x, n):
idx = np.random.choice(np.arange(len(x)), n, replace=True)
return x.take(idx)
num_from_each = 2
sample = groups.apply(choose_n, num_from_each)
作为一个更完整的示例,使用picka
库随机生成一些数据:
import numpy as np
import pandas as pd
import picka
# Generate some realistic random data using "picka"
num = 200
names = [picka.name() for _ in range(num)]
phones = [picka.phone_number() for _ in range(num)]
# Let's limit it to a smaller number of cities and states...
cities = np.random.choice(['Springfield', 'Houston', 'Dallas'], num)
states = np.random.choice(['IL', 'TX', 'TN', 'CA'], num)
df = pd.DataFrame(dict(name=names, phone=phones, city=cities, state=states))
# Group by city, state
groups = df.groupby(['state', 'city'])
# Then get a result with n from each unique city,state
def choose_n(x, n):
idx = np.random.choice(np.arange(len(x)), n, replace=True)
return x.take(idx)
num_from_each = 2
sample = groups.apply(choose_n, num_from_each)
print sample
这导致:
city name phone state
state city
CA Dallas 72 Dallas Sarina 133-258-6775 CA
46 Dallas Dusty 799-563-7043 CA
Houston 158 Houston Artie 591-835-3043 CA
195 Houston Federico 899-315-1205 CA
Springfield 66 Springfield Ollie 326-076-1329 CA
53 Springfield Li 702-555-6594 CA
IL Dallas 154 Dallas Lou 146-404-9668 IL
39 Dallas Ollie 399-256-7836 IL
Houston 190 Houston Scarlett 278-499-6901 IL
89 Houston Rhonda 619-966-3691 IL
Springfield 119 Springfield Jae 180-444-0253 IL
130 Springfield Tawna 630-953-5200 IL
TN Dallas 25 Dallas Frank 475-964-0279 TN
50 Dallas Kiara 764-240-4802 TN
Houston 95 Houston Britney 661-490-5178 TN
107 Houston Tommie 648-945-5608 TN
Springfield 55 Springfield Kecia 891-643-2644 TN
55 Springfield Kecia 891-643-2644 TN
TX Dallas 116 Dallas Mara 636-589-0435 TX
98 Dallas Lajuana 759-788-4742 TX
Houston 103 Houston Casey 600-522-2874 TX
140 Houston Rachal 762-082-9017 TX
Springfield 197 Springfield Staci 021-981-7593 TX
168 Springfield Sherrill 754-736-8409 TX
答案 1 :(得分:1)
假设您实际上正在寻找的是一个棘手的想法,那么这是一个可以解决它的实现。它不使用pandas
,这可能是一个错误,但我没有看到这是对你的问题的严格要求,我认为这将更直接:
def random_city_sample(n, input_file='my_csv.csv')
samples = set()
city_counter = collections.Counter()
reader = csv.reader(open(input_file), delimiter=",", quotechar="\"")
# Shuffles your entries as well as removing duplicate entries
sample_set = set(tuple(row) for row in reader)
while len(samples) < n:
added_samples = sampling_run(sample_set, city_counter)
# Add selected samples to universal sample list
samples.update(added_samples)
# Remove only those samples which have been successfully selected
sample_set = sample_set.difference(added_samples)
def sampling_run(master_set, city_counter):
city_ticker = 0
current_city = ''
samples_selected = set()
for entry in master_set:
city = entry[2]
if city == current_city:
city_ticker += 1
else:
current_city = city
city_ticker = 1
if city_ticker > city_counter[city]:
samples_selected.update(entry)
return samples_selected
虽然这确实意味着如果你有一个非常稀疏的csv,可能会有问题,如果你将迭代更改为随机样本它会绕过那个,但我不确定你是否愿意:
def random_city_sample(n, input_file='my_csv.csv')
samples = set()
city_counter = collections.Counter()
reader = csv.reader(open(input_file), delimiter=",", quotechar="\"")
# Shuffles your entries as well as removing duplicate entries
sample_set = set(tuple(row) for row in reader)
while len(samples_selected) < n
city_ticker = 0
current_city = ''
samples_selected = set()
entry = random.sample(sample_set, 1)
city = entry[2]
if city == current_city:
city_ticker += 1
else:
current_city = city
city_ticker = 1
if city_ticker > city_counter[city]:
samples.update(entry)
sample_set.remove(entry)
我希望有所帮助!如果您还有其他问题,请与我们联系。