我想要阅读的CSV文件不适合主内存。如何读取它的几个(~10K)随机行并对所选数据帧进行一些简单的统计?
答案 0 :(得分:48)
假设CSV文件中没有标题:
import pandas
import random
n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(xrange(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)
如果read_csv有一个keeprows,或者如果skiprows使用回调函数而不是列表,那么会更好。
标题和未知文件长度:
import pandas
import random
filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(xrange(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
答案 1 :(得分:23)
@ dlm' s answer很棒,但自v0.20.0 skiprows does accept a callable以来。 callable接收行号作为参数。
如果你可以指定你想要的行数,而不是多少行,你甚至不需要获得文件大小而你只是需要通读文件一次。假设第一行有一个标题:
import pandas as pd
import random
p = 0.01 # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
filename,
header=0,
skiprows=lambda i: i>0 and random.random() > p
)
或者,如果您想要采用每个n
行:
n = 100 # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)
答案 2 :(得分:15)
这不是在Pandas中,但它通过bash更快地实现了相同的结果:
shuf -n 100000 data/original.tsv > data/sample.tsv
shuf
命令将随机输入,而-n
参数表示输出中需要多少行。
相关问题:https://unix.stackexchange.com/q/108581
7M线上的基准csv可用here(2008):
回答:
def pd_read():
filename = "2008.csv"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 100000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
df.to_csv("temp.csv")
%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s
使用shuf
时:
time shuf -n 100000 2008.csv > temp.csv
real 0m1.583s
user 0m1.445s
sys 0m0.136s
所以shuf
快了大约12倍,重要的是不会将整个文件读入内存。
答案 3 :(得分:10)
这是一种算法,不需要事先计算文件中的行数,因此您只需要读取一次文件。
假设您想要m个样本。首先,算法保留前m个样本。当它以概率m / i看到第i个样本(i> m)时,算法使用该样本随机替换已经选择的样本。
通过这样做,对于任何i> m,我们总是有从前i个样本中随机选择的m个样本的子集。
见下面的代码:
import random
n_samples = 10
samples = []
for i, line in enumerate(f):
if i < n_samples:
samples.append(line)
elif random.random() < n_samples * 1. / (i+1):
samples[random.randint(0, n_samples-1)] = line
答案 4 :(得分:2)
以下代码首先读取标题,然后在其他行上读取随机样本:
import pandas as pd
import numpy as np
filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)
答案 5 :(得分:2)
在将其带入Python环境之前,您还可以创建具有10000条记录的示例。
使用Git Bash(Windows 10),我只是运行以下命令来生成示例
shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv
要注意:如果您的CSV文件带有标题,则不是最佳解决方案。
答案 6 :(得分:1)
没有熊猫!
import random
from os import fstat
from sys import exit
f = open('/usr/share/dict/words')
# Number of lines to be read
lines_to_read = 100
# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000
def is_EOF():
return f.tell() >= fstat(f.fileno()).st_size
# To accumulate the read lines
sampled_lines = []
for n in xrange(lines_to_read):
bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
f.seek(bytes_to_skip, 1)
# After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
# Skip current entire line
f.readline()
if not is_EOF():
sampled_lines.append(f.readline())
else:
# Go to the begginig of the file ...
f.seek(0, 0)
# ... and skip lines again
f.seek(bytes_to_skip, 1)
# If it has reached the EOF again
if is_EOF():
print "You have skipped more lines than your file has"
print "Reduce the values of:"
print " min_bytes_to_skip"
print " max_bytes_to_skip"
exit(1)
else:
f.readline()
sampled_lines.append(f.readline())
print sampled_lines
您最终会得到一个samples_lines列表。你是什么意思?
答案 7 :(得分:1)
使用 subsample
import os, sys
import pandas as pd
def compare(name,current_data,thresholds):
reference=current_data.loc['INVITE','Recent-Server']
# Check if we have INVITES events
if reference == '0':
print "{}: critical status".format(name)
return
for i in sorted(thresholds.keys()):
try:
current=current_data.loc[i, 'Recent-Server']
if current != '0':
valor=thresholds[i]
except IndexError:
print "Index Error"
clear="source.csv"
current = pd.read_csv(clear, names=['Message','Event','Recent-Server','Total-Server','PerMax-Server','Recent-Client','Total-Client','PerMax-Client'])
current.set_index("Message", inplace=True)
responses_all=("100", "180", "181", "182", "183", "200", "5xx")
# Thresholds for each event type
thresholds_mia={
responses_all[0]: ["value1"], #100 Trying
responses_all[1]: ["value2"], #180 Ringing
responses_all[2]: ["value3"], #181 Forwarded
responses_all[3]: ["value4"], #182 Queued
responses_all[4]: ["value5"], #183 Progress
responses_all[5]: ["value6"], #200 OK
responses_all[6]: ["value7"] #5xx Server Error
}
# Main
compare("Name",current,thresholds_mia)
答案 8 :(得分:1)
如果您知道所需样本的大小,但不知道输入文件的大小,则可以使用以下pandas
代码从其中有效地加载随机样本:
import pandas as pd
import numpy as np
filename = "data.csv"
sample_size = 10000
batch_size = 200
rng = np.random.default_rng()
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
sample = sample_reader.get_chunk(sample_size)
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
sample.loc[chunk.index] = chunk
了解输入CSV文件的大小并不总是那么简单。
如果有embedded line breaks,则诸如wc
或shuf
之类的工具会给您错误的答案,或者只会使您的数据混乱。
因此,基于desktable的answer,我们可以将文件的前sample_size
行视为初始样本,然后对文件中的每一行进行随机处理替换初始样本中的一行。
To do that efficiently,我们通过传递TextFileReader
参数,使用chunksize=
加载CSV文件:
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
首先,我们获得了初始样本:
sample = sample_reader.get_chunk(sample_size)
然后,我们遍历文件的其余块,只要块的大小只要用a sequence of random integers替换每个块的索引,但每个整数都在index
的范围内的初始样本(恰好与range(sample_size)
相同)
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
并使用此重新索引的块替换示例中的(某些)行:
sample.loc[chunk.index] = chunk
在for
循环之后,您将有一个最长不超过sample_size
行的数据框,但从大型CSV文件中选择了随机行。
要使循环更有效,可以将batch_size
设置为内存允许的大小(是的,如果可以的话,甚至可以大于sample_size
)。
请注意,在使用np.random.default_rng().integers()
创建新的块索引时,我们使用len(chunk)
作为新的块索引大小,而不是简单地使用batch_size
,因为循环中的最后一个块可能会更小
另一方面,我们使用sample_size
代替len(sample)
作为随机整数的“范围”,即使文件中的行数可能少于sample_size
。这是因为在这种情况下,将没有任何块可以循环,所以永远不会有问题。
答案 9 :(得分:0)
class magic_checker:
def __init__(self,target_count):
self.target = target_count
self.count = 0
def __eq__(self,x):
self.count += 1
return self.count >= self.target
min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
f.seek(seek_target)
f.readline() #discard this line
rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))
#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")
这样的事情我认为应该有用
答案 10 :(得分:0)
例如,您拥有loan.csv,则可以使用此脚本轻松加载指定数量的随机项目。
data = pd.read_csv('loan.csv').sample(10000, random_state=44)
答案 11 :(得分:0)
import pandas as pd
df = pd.read_csv('data.csv', 'r')
df.shape()
sample_data = df.sample(n=1000, replace='False')
#检查sample_data的形状
sample_data.shape()
答案 12 :(得分:-1)
假设您要加载20%的数据集样本:
ServiceContext serviceContext = ServiceContextFactory.getInstance(DDLRecord.class.getName(), actionRequest);
serviceContext.getAttribute('field_name');