(重新)在Python中加权随机CSV样本

时间:2014-01-08 18:10:25

标签: python python-2.7 csv random pandas

我有一个(大)目录CSV,其列为[0:3] =电话号码,姓名,城市,州。

我创建了一个包含20,000个条目的随机样本,但当然,它对人口较多的州和城市进行了大量加权。

我如何编写一个python代码(使用CSV或Pandas - 请不要使用linecache)同样优先考虑/加权每个独特城市和每个州(单独,不是一对),并将每个独特城市限制为3个选择?


TRICKIER想法:我如何编写一个python代码,以便对于每个被挑选的随机行,它会检查之前是否已经选择了该城市。如果之前已经选择了该城市,它会忽略它并再次选择一条随机线,将该城市之前考虑的选择数减少一个。所以说,它随机选择了圣安东尼奥马刺队,之前已经两次被选中。该脚本忽略此选择,将其放回列表中,减少当前考虑的圣安东尼奥选择的数量,然后再次随机选择一行。如果它再次从圣安东尼奥挑选一条线,那么它会重复前一个过程,现在减少考虑的圣安东尼奥选择为0.所以它必须连续三次选择圣安东尼奥,以增加圣安东尼奥的另一条线。对于未来的选秀权,它必须连续四次选择圣安东尼奥马刺队,并且每增加一个选秀权一次。

我不知道第二个选项如何能够“分散”我的随机选择 - 这只是一个想法,它看起来像一个学习更多pythonese的有趣方式。我们将非常感谢沿着同一思路的任何其他想法。我们也欢迎对统计抽样和样本散布的见解。

2 个答案:

答案 0 :(得分:2)

我可能误解了你想要做的事情。

我认为你想要的东西有点复杂。我不太明白你的问题,但希望这个例子能给你一些思考的东西。

但是,您可能希望使用各种库进行采样。总而言之,您可以使用pandas

在几行中执行此操作
# Group by city, state
groups = df.groupby(['state', 'city'])

# Then get a result with n from each unique city,state
def choose_n(x, n):
    idx = np.random.choice(np.arange(len(x)), n, replace=True)
    return x.take(idx)

num_from_each = 2
sample = groups.apply(choose_n, num_from_each)

作为一个更完整的示例,使用picka库随机生成一些数据:

import numpy as np
import pandas as pd
import picka

# Generate some realistic random data using "picka"
num = 200
names = [picka.name() for _ in range(num)]
phones = [picka.phone_number() for _ in range(num)]
# Let's limit it to a smaller number of cities and states...
cities = np.random.choice(['Springfield', 'Houston', 'Dallas'], num)
states = np.random.choice(['IL', 'TX', 'TN', 'CA'], num)

df = pd.DataFrame(dict(name=names, phone=phones, city=cities, state=states))

# Group by city, state
groups = df.groupby(['state', 'city'])

# Then get a result with n from each unique city,state
def choose_n(x, n):
    idx = np.random.choice(np.arange(len(x)), n, replace=True)
    return x.take(idx)

num_from_each = 2
sample = groups.apply(choose_n, num_from_each)
print sample

这导致:

                              city      name         phone state
state city
CA    Dallas      72        Dallas    Sarina  133-258-6775    CA
                  46        Dallas     Dusty  799-563-7043    CA
      Houston     158      Houston     Artie  591-835-3043    CA
                  195      Houston  Federico  899-315-1205    CA
      Springfield 66   Springfield     Ollie  326-076-1329    CA
                  53   Springfield        Li  702-555-6594    CA
IL    Dallas      154       Dallas       Lou  146-404-9668    IL
                  39        Dallas     Ollie  399-256-7836    IL
      Houston     190      Houston  Scarlett  278-499-6901    IL
                  89       Houston    Rhonda  619-966-3691    IL
      Springfield 119  Springfield       Jae  180-444-0253    IL
                  130  Springfield     Tawna  630-953-5200    IL
TN    Dallas      25        Dallas     Frank  475-964-0279    TN
                  50        Dallas     Kiara  764-240-4802    TN
      Houston     95       Houston   Britney  661-490-5178    TN
                  107      Houston    Tommie  648-945-5608    TN
      Springfield 55   Springfield     Kecia  891-643-2644    TN
                  55   Springfield     Kecia  891-643-2644    TN
TX    Dallas      116       Dallas      Mara  636-589-0435    TX
                  98        Dallas   Lajuana  759-788-4742    TX
      Houston     103      Houston     Casey  600-522-2874    TX
                  140      Houston    Rachal  762-082-9017    TX
      Springfield 197  Springfield     Staci  021-981-7593    TX
                  168  Springfield  Sherrill  754-736-8409    TX

答案 1 :(得分:1)

假设您实际上正在寻找的是一个棘手的想法,那么这是一个可以解决它的实现。它不使用pandas,这可能是一个错误,但我没有看到这是对你的问题的严格要求,我认为这将更直接:

def random_city_sample(n, input_file='my_csv.csv')
    samples = set()
    city_counter = collections.Counter()
    reader = csv.reader(open(input_file), delimiter=",", quotechar="\"")

    # Shuffles your entries as well as removing duplicate entries
    sample_set = set(tuple(row) for row in reader)
    while len(samples) < n:
        added_samples = sampling_run(sample_set, city_counter)

        # Add selected samples to universal sample list
        samples.update(added_samples)

        # Remove only those samples which have been successfully selected
        sample_set = sample_set.difference(added_samples)

def sampling_run(master_set, city_counter):
    city_ticker = 0
    current_city = ''
    samples_selected = set()
    for entry in master_set:
        city = entry[2]
        if city == current_city:
            city_ticker += 1
        else:
            current_city = city
            city_ticker = 1
        if city_ticker > city_counter[city]:
            samples_selected.update(entry)
    return samples_selected

虽然这确实意味着如果你有一个非常稀疏的csv,可能会有问题,如果你将迭代更改为随机样本它会绕过那个,但我不确定你是否愿意:

def random_city_sample(n, input_file='my_csv.csv')
    samples = set()
    city_counter = collections.Counter()
    reader = csv.reader(open(input_file), delimiter=",", quotechar="\"")

    # Shuffles your entries as well as removing duplicate entries
    sample_set = set(tuple(row) for row in reader)

    while len(samples_selected) < n
        city_ticker = 0
        current_city = ''
        samples_selected = set()

        entry = random.sample(sample_set, 1)
        city = entry[2]
        if city == current_city:
            city_ticker += 1
        else:
            current_city = city
            city_ticker = 1
        if city_ticker > city_counter[city]:
            samples.update(entry)
            sample_set.remove(entry)

我希望有所帮助!如果您还有其他问题,请与我们联系。