Question

我正在尝试在python中整理一个大的（8gb）.csv文件，然后将其流式传输到BigQuery中。下面的代码开始没问题，因为创建了表并且前1000行进入，但后来我收到了错误：

InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.

这可能与流缓冲区有关吗？我的问题是我需要在再次运行代码之前删除该表，否则前1000个条目将被复制，因为＆＃39;追加＆＃39;方法

import pandas as pd

destination_table = 'product_data.FS_orders'
project_id = '##'
pkey ='##'

chunks = []

for chunk in pd.read_csv('Historic_orders.csv',chunksize=1000, encoding='windows-1252', names=['Orderdate','Weborderno','Productcode','Quantitysold','Paymentmethod','ProductGender','DeviceType','Brand','ProductDescription','OrderType','ProductCategory','UnitpriceGBP' 'Webtype1','CostPrice','Webtype2','Webtype3','Variant','Orderlinetax']):
    chunk.replace(r' *!','Null', regex=True)
    chunk.to_gbq(destination_table, project_id, if_exists='append', private_key=pkey)
    chunks.append(chunk)

df = pd.concat(chunks, axis=0)

print(df.head(5))

pd.to_csv('Historic_orders_cleaned.csv')

Answer 1

问题： - 为什么流媒体而不是简单地加载？这样您就可以上传1 GB而不是1000行的批次。当您确实有连续数据需要在发生时附加时，通常会出现流式传输。如果您在收集数据和加载作业之间有1天的休息时间，那么加载它通常会更安全。 see here

除此之外。我从csv文件加载bigQuery中的表时遇到了一些问题，大部分时间它都是1）编码（我看到你有非utf-8编码）和2）无效字符，一些丢失的逗号在文件中间打破了界限。

要验证这一点，如果向后插入行会怎么样？你得到同样的错误吗？

将大文件传入BigQuery

1 个答案: