TL; DR

Question

我想要阅读的CSV文件不适合主内存。如何读取它的几个（~10K）随机行并对所选数据帧进行一些简单的统计？

Answer 1

假设CSV文件中没有标题：

import pandas
import random

n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(xrange(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)

如果read_csv有一个keeprows，或者如果skiprows使用回调函数而不是列表，那么

会更好。

标题和未知文件长度：

import pandas
import random

filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(xrange(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)

Answer 2

@ dlm＆＃39; s answer很棒，但自v0.20.0 skiprows does accept a callable以来。 callable接收行号作为参数。

如果你可以指定你想要的行数，而不是多少行，你甚至不需要获得文件大小而你只是需要通读文件一次。假设第一行有一个标题：

import pandas as pd
import random
p = 0.01  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

或者，如果您想要采用每个n行：

n = 100  # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

Answer 3

这不是在Pandas中，但它通过bash更快地实现了相同的结果：

shuf -n 100000 data/original.tsv > data/sample.tsv

shuf命令将随机输入，而-n参数表示输出中需要多少行。

相关问题：https://unix.stackexchange.com/q/108581

7M线上的基准csv可用here（2008）：

回答：

def pd_read():
    filename = "2008.csv"
    n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
    s = 100000 #desired sample size
    skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
    df = pandas.read_csv(filename, skiprows=skip)
    df.to_csv("temp.csv")

%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s

使用shuf时：

time shuf -n 100000 2008.csv > temp.csv

real    0m1.583s
user    0m1.445s
sys     0m0.136s

所以shuf快了大约12倍，重要的是不会将整个文件读入内存。

Answer 4

这是一种算法，不需要事先计算文件中的行数，因此您只需要读取一次文件。

假设您想要m个样本。首先，算法保留前m个样本。当它以概率m / i看到第i个样本（i> m）时，算法使用该样本随机替换已经选择的样本。

通过这样做，对于任何i＆gt; m，我们总是有从前i个样本中随机选择的m个样本的子集。

见下面的代码：

import random

n_samples = 10
samples = []

for i, line in enumerate(f):
    if i < n_samples:
        samples.append(line)
    elif random.random() < n_samples * 1. / (i+1):
            samples[random.randint(0, n_samples-1)] = line

Answer 5

以下代码首先读取标题，然后在其他行上读取随机样本：

import pandas as pd
import numpy as np

filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)

Answer 6

在将其带入Python环境之前，您还可以创建具有10000条记录的示例。

使用Git Bash（Windows 10），我只是运行以下命令来生成示例

shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv

要注意：如果您的CSV文件带有标题，则不是最佳解决方案。

Answer 7

没有熊猫！

import random
from os import fstat
from sys import exit

f = open('/usr/share/dict/words')

# Number of lines to be read
lines_to_read = 100

# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000

def is_EOF():
    return f.tell() >= fstat(f.fileno()).st_size

# To accumulate the read lines
sampled_lines = []

for n in xrange(lines_to_read):
    bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
    f.seek(bytes_to_skip, 1)
    # After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
    # Skip current entire line
    f.readline()
    if not is_EOF():
        sampled_lines.append(f.readline())
    else:
        # Go to the begginig of the file ...
        f.seek(0, 0)
        # ... and skip lines again
        f.seek(bytes_to_skip, 1)
        # If it has reached the EOF again
        if is_EOF():
            print "You have skipped more lines than your file has"
            print "Reduce the values of:"
            print "   min_bytes_to_skip"
            print "   max_bytes_to_skip"
            exit(1)
        else:
            f.readline()
            sampled_lines.append(f.readline())

print sampled_lines

您最终会得到一个samples_lines列表。你是什么意思？

Answer 8

使用 subsample

import os, sys
import pandas as pd

def compare(name,current_data,thresholds):
    reference=current_data.loc['INVITE','Recent-Server']
    # Check if we have INVITES events
    if reference == '0':
        print "{}: critical status".format(name)
        return

    for i in sorted(thresholds.keys()):
        try:
            current=current_data.loc[i, 'Recent-Server']
            if current != '0':              
                valor=thresholds[i]
        except IndexError:
            print "Index Error"

clear="source.csv"
current = pd.read_csv(clear, names=['Message','Event','Recent-Server','Total-Server','PerMax-Server','Recent-Client','Total-Client','PerMax-Client'])
current.set_index("Message", inplace=True)
responses_all=("100", "180", "181", "182", "183", "200", "5xx")

# Thresholds for each event type
thresholds_mia={
responses_all[0]: ["value1"],   #100 Trying         
responses_all[1]: ["value2"],   #180 Ringing        
responses_all[2]: ["value3"],   #181 Forwarded      
responses_all[3]: ["value4"],   #182 Queued         
responses_all[4]: ["value5"],   #183 Progress       
responses_all[5]: ["value6"],   #200 OK             
responses_all[6]: ["value7"]    #5xx Server Error   
}

# Main
compare("Name",current,thresholds_mia)

Answer 9

TL; DR

如果您知道所需样本的大小，但不知道输入文件的大小，则可以使用以下pandas代码从其中有效地加载随机样本：

import pandas as pd
import numpy as np

filename = "data.csv"
sample_size = 10000
batch_size = 200

rng = np.random.default_rng()

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

sample = sample_reader.get_chunk(sample_size)

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))
    sample.loc[chunk.index] = chunk

说明

了解输入CSV文件的大小并不总是那么简单。

如果有embedded line breaks，则诸如wc或shuf之类的工具会给您错误的答案，或者只会使您的数据混乱。

因此，基于desktable的answer，我们可以将文件的前sample_size行视为初始样本，然后对文件中的每一行进行随机处理替换初始样本中的一行。

To do that efficiently，我们通过传递TextFileReader参数，使用chunksize=加载CSV文件：

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

首先，我们获得了初始样本：

sample = sample_reader.get_chunk(sample_size)

然后，我们遍历文件的其余块，只要块的大小只要用a sequence of random integers替换每个块的索引，但每个整数都在index的范围内的初始样本（恰好与range(sample_size)相同）

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))

并使用此重新索引的块替换示例中的（某些）行：

sample.loc[chunk.index] = chunk

在for循环之后，您将有一个最长不超过sample_size行的数据框，但从大型CSV文件中选择了随机行。

要使循环更有效，可以将batch_size设置为内存允许的大小（是的，如果可以的话，甚至可以大于sample_size）。

请注意，在使用np.random.default_rng().integers()创建新的块索引时，我们使用len(chunk)作为新的块索引大小，而不是简单地使用batch_size，因为循环中的最后一个块可能会更小

另一方面，我们使用sample_size代替len(sample)作为随机整数的“范围”，即使文件中的行数可能少于sample_size。这是因为在这种情况下，将没有任何块可以循环，所以永远不会有问题。

Answer 10

class magic_checker:
    def __init__(self,target_count):
        self.target = target_count
        self.count = 0
    def __eq__(self,x):
        self.count += 1
        return self.count >= self.target

min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
     f.seek(seek_target)
     f.readline() #discard this line
     rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))

#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")

这样的事情我认为应该有用

Answer 11

例如，您拥有loan.csv，则可以使用此脚本轻松加载指定数量的随机项目。

data = pd.read_csv('loan.csv').sample(10000, random_state=44)

Answer 12

读取数据文件

import pandas as pd
df = pd.read_csv('data.csv', 'r')

首先检查df的形状

df.shape()

从df中创建1000个原始样品的小样本

sample_data = df.sample(n=1000, replace='False')

＃检查sample_data的形状

sample_data.shape()

Answer 13

假设您要加载20％的数据集样本：

ServiceContext serviceContext = ServiceContextFactory.getInstance(DDLRecord.class.getName(), actionRequest);
serviceContext.getAttribute('field_name');

从大型CSV文件中读取一个小的随机样本到Python数据框中

13 个答案:

TL; DR

说明

读取数据文件

首先检查df的形状

从df中创建1000个原始样品的小样本