我想读取一个大.bz2文件的随机样本。
类似于您在这样的csv示例中的阅读方式:
import pandas
import random
n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.csv"
skip = sorted(random.sample(xrange(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)
我已经弄清楚了如何分块读取文件,但这不是随机的。
import os, json
import pandas as pd
import numpy as np
import glob
import random
pd.set_option('display.max_columns', None)
temp = pd.DataFrame()
path_to_json = '/content/drive/My Drive/Loghost/'
json_pattern = os.path.join(path_to_json,'*.bz2')
file_list = glob.glob(json_pattern)
for file in file_list:
chunks = pd.read_json(file, lines=True, chunksize=3000000)
i = 0
chunk_list = []
for chunk in chunks:
i+=1
user = chunk[random.sample(chunk.UserName)] # i want to take a random sample of 100 users
chunk_list.append(user)
print("Progress:", i)
del chunk
df = pd.concat(chunk_list, sort = True)
temp = temp.append(df, sort = True)
上面提到的那一行是我试图通过选择用户的随机样本来使行随机化的地方,但是似乎行不通。有任何想法吗?