从大型CSV文件中读取一个小的随机样本到Python数据框中

时间:2014-03-07 19:00:09

标签: python pandas random io import-from-csv

我想要阅读的CSV文件不适合主内存。如何读取它的几个(~10K)随机行并对所选数据帧进行一些简单的统计?

13 个答案:

答案 0 :(得分:48)

假设CSV文件中没有标题:

import pandas
import random

n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(xrange(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)
如果read_csv有一个keeprows,或者如果skiprows使用回调函数而不是列表,那么

会更好。

标题和未知文件长度:

import pandas
import random

filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(xrange(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)

答案 1 :(得分:23)

@ dlm' s answer很棒,但自v0.20.0 skiprows does accept a callable以来。 callable接收行号作为参数。

如果你可以指定你想要的行数,而不是多少行,你甚至不需要获得文件大小而你只是需要通读文件一次。假设第一行有一个标题:

import pandas as pd
import random
p = 0.01  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

或者,如果您想要采用每个n行:

n = 100  # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

答案 2 :(得分:15)

这不是在Pandas中,但它通过bash更快地实现了相同的结果:

shuf -n 100000 data/original.tsv > data/sample.tsv

shuf命令将随机输入,而-n参数表示输出中需要多少行。

相关问题:https://unix.stackexchange.com/q/108581

7M线上的基准csv可用here(2008):

回答:

def pd_read():
    filename = "2008.csv"
    n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
    s = 100000 #desired sample size
    skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
    df = pandas.read_csv(filename, skiprows=skip)
    df.to_csv("temp.csv")

%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s

使用shuf时:

time shuf -n 100000 2008.csv > temp.csv

real    0m1.583s
user    0m1.445s
sys     0m0.136s

所以shuf快了大约12倍,重要的是不会将整个文件读入内存。

答案 3 :(得分:10)

这是一种算法,不需要事先计算文件中的行数,因此您只需要读取一次文件。

假设您想要m个样本。首先,算法保留前m个样本。当它以概率m / i看到第i个样本(i> m)时,算法使用该样本随机替换已经选择的样本。

通过这样做,对于任何i> m,我们总是有从前i个样本中随机选择的m个样本的子集。

见下面的代码:

import random

n_samples = 10
samples = []

for i, line in enumerate(f):
    if i < n_samples:
        samples.append(line)
    elif random.random() < n_samples * 1. / (i+1):
            samples[random.randint(0, n_samples-1)] = line

答案 4 :(得分:2)

以下代码首先读取标题,然后在其他行上读取随机样本:

import pandas as pd
import numpy as np

filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)

答案 5 :(得分:2)

在将其带入Python环境之前,您还可以创建具有10000条记录的示例。

使用Git Bash(Windows 10),我只是运行以下命令来生成示例

shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv

要注意:如果您的CSV文件带有标题,则不是最佳解决方案。

答案 6 :(得分:1)

没有熊猫!

import random
from os import fstat
from sys import exit

f = open('/usr/share/dict/words')

# Number of lines to be read
lines_to_read = 100

# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000

def is_EOF():
    return f.tell() >= fstat(f.fileno()).st_size

# To accumulate the read lines
sampled_lines = []

for n in xrange(lines_to_read):
    bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
    f.seek(bytes_to_skip, 1)
    # After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
    # Skip current entire line
    f.readline()
    if not is_EOF():
        sampled_lines.append(f.readline())
    else:
        # Go to the begginig of the file ...
        f.seek(0, 0)
        # ... and skip lines again
        f.seek(bytes_to_skip, 1)
        # If it has reached the EOF again
        if is_EOF():
            print "You have skipped more lines than your file has"
            print "Reduce the values of:"
            print "   min_bytes_to_skip"
            print "   max_bytes_to_skip"
            exit(1)
        else:
            f.readline()
            sampled_lines.append(f.readline())

print sampled_lines

您最终会得到一个samples_lines列表。你是什​​么意思?

答案 7 :(得分:1)

使用 subsample

import os, sys
import pandas as pd

def compare(name,current_data,thresholds):
    reference=current_data.loc['INVITE','Recent-Server']
    # Check if we have INVITES events
    if reference == '0':
        print "{}: critical status".format(name)
        return

    for i in sorted(thresholds.keys()):
        try:
            current=current_data.loc[i, 'Recent-Server']
            if current != '0':              
                valor=thresholds[i]
        except IndexError:
            print "Index Error"

clear="source.csv"
current = pd.read_csv(clear, names=['Message','Event','Recent-Server','Total-Server','PerMax-Server','Recent-Client','Total-Client','PerMax-Client'])
current.set_index("Message", inplace=True)
responses_all=("100", "180", "181", "182", "183", "200", "5xx")

# Thresholds for each event type
thresholds_mia={
responses_all[0]: ["value1"],   #100 Trying         
responses_all[1]: ["value2"],   #180 Ringing        
responses_all[2]: ["value3"],   #181 Forwarded      
responses_all[3]: ["value4"],   #182 Queued         
responses_all[4]: ["value5"],   #183 Progress       
responses_all[5]: ["value6"],   #200 OK             
responses_all[6]: ["value7"]    #5xx Server Error   
}

# Main
compare("Name",current,thresholds_mia)

答案 8 :(得分:1)

TL; DR

如果您知道所需样本的大小,但不知道输入文件的大小,则可以使用以下pandas代码从其中有效地加载随机样本:

import pandas as pd
import numpy as np

filename = "data.csv"
sample_size = 10000
batch_size = 200

rng = np.random.default_rng()

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

sample = sample_reader.get_chunk(sample_size)

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))
    sample.loc[chunk.index] = chunk

说明

了解输入CSV文件的大小并不总是那么简单。

如果有embedded line breaks,则诸如wcshuf之类的工具会给您错误的答案,或者只会使您的数据混乱。

因此,基于desktableanswer,我们可以将文件的前sample_size行视为初始样本,然后对文件中的每一行进行随机处理替换初始样本中的一行。

To do that efficiently,我们通过传递TextFileReader参数,使用chunksize=加载CSV文件:

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

首先,我们获得了初始样本:

sample = sample_reader.get_chunk(sample_size)

然后,我们遍历文件的其余块,只要块的大小只要用a sequence of random integers替换每个块的索引,但每个整数都在index的范围内的初始样本(恰好与range(sample_size)相同)

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))

并使用此重新索引的块替换示例中的(某些)行:

sample.loc[chunk.index] = chunk

for循环之后,您将有一个最长不超过sample_size行的数据框,但从大型CSV文件中选择了随机行。

要使循环更有效,可以将batch_size设置为内存允许的大小(是的,如果可以的话,甚至可以大于sample_size)。

请注意,在使用np.random.default_rng().integers()创建新的块索引时,我们使用len(chunk)作为新的块索引大小,而不是简单地使用batch_size,因为循环中的最后一个块可能会更小

另一方面,我们使用sample_size代替len(sample)作为随机整数的“范围”,即使文件中的行数可能少于sample_size。这是因为在这种情况下,将没有任何块可以循环,所以永远不会有问题。

答案 9 :(得分:0)

class magic_checker:
    def __init__(self,target_count):
        self.target = target_count
        self.count = 0
    def __eq__(self,x):
        self.count += 1
        return self.count >= self.target

min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
     f.seek(seek_target)
     f.readline() #discard this line
     rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))

#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")

这样的事情我认为应该有用

答案 10 :(得分:0)

例如,您拥有loan.csv,则可以使用此脚本轻松加载指定数量的随机项目。

data = pd.read_csv('loan.csv').sample(10000, random_state=44)

答案 11 :(得分:0)

读取数据文件

import pandas as pd
df = pd.read_csv('data.csv', 'r')

首先检查df的形状

df.shape()

从df中创建1000个原始样品的小样本

sample_data = df.sample(n=1000, replace='False')

#检查sample_data的形状

sample_data.shape()

答案 12 :(得分:-1)

假设您要加载20%的数据集样本:

ServiceContext serviceContext = ServiceContextFactory.getInstance(DDLRecord.class.getName(), actionRequest);
serviceContext.getAttribute('field_name');