该问题的一些变体是answered here和here,我已成功使用
尽管如此,我的问题略有不同。我已经使用BigQuery将1GB数据导出到Google Storage中。此导出分为5个csv文件,每个数据集包含列名称(我认为这是导致事情中断的原因)。
我的代码是:
# Run import
import pandas as pd
import numpy as np
from io import BytesIO
# Grab the file from the cloud storage
variable_list = ['part1', 'part2','part3','part4','part5']
for variable in variable_list:
file_path = "gs://[Bucket-name]/" + variable + ".csv"
%gcs read --object {file_path} --variable byte_data
# Read the dataset
data = pd.read_csv(BytesIO(byte_data), low_memory=False)
但是,当我拨打len(data)
时,我没有收到全部行数。上面的代码似乎只加载1个文件。
我可以加载5个不同的数据框,然后通过data=[df1, df2, df3, df4, df5]
将它们组合在大熊猫中,但它看起来非常难看。
答案 0 :(得分:0)
我找到了这个,并采用了我的案例。我在存储桶(文件夹)中运行所有文件:
from google.datalab import Context
import google.datalab.storage as storage
import pandas as pd
try:
from StringIO import StringIO
except ImportError:
from io import BytesIO as StringIO
bucket_folder = 'ls_w'
df = pd.DataFrame() # Final dataframe
for obj in bucket.objects(): # loop in all objects of the bucket
if '/' not in obj.key: # add other options to exclude other files
# in this case it looks only at bucket level
# not into subfolders!
fn = obj.key # created file name variable (optional)
print(obj.key)
bites = 'gs://%s/%s' % (bucket_folder, fn)
%gcs read --object $bites --variable data
tdf = pd.read_csv(StringIO(data)) # read
df = pd.concat([df, tdf]) # concatenate results