我正在尝试处理Kaggle中的数据。我已将其下载到硬盘并运行以下代码:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import json
import multiprocessing
def worker():
df = pd.DataFrame([])
i = 0
for dirname, _, filenames in os.walk('E:\Python_Project_Files\Finnhub API\Financials'):
for filename in filenames:
# i == 100 for testing purposes, otherwise 250000, because it's bigger than count of all of the files or remove completely
if i <= 250000:
df = df.append(pd.DataFrame(pd.read_json(os.path.join(dirname, filename))))
# To debug where it stopped
if i % 10000 == 0:
print(i)
i += 1
else:
break
def export():
# Flattening the column "data" because it contains json string
json_struct = json.loads(df.to_json(orient="records"))
df_flat = pd.json_normalize(json_struct)
df_flat.to_csv(path_or_buf="E:\Python_Project_Files\Finnhub API\DF_flat.csv")
symbols = df_flat.symbol.unique()
# Filter for each of the ticker symbols and export (preferably only CSV, more formats takes too much time)
for ticker in symbols:
df_ticker = df_flat[df_flat["symbol"] == ticker]
df_ticker.to_csv(path_or_buf="E:\Python_Project_Files\Finnhub API\DataFrame\CSV\df_" + ticker + "_flat.csv")API\DataFrame\CSV\df_" + ticker + "_flat.csv")
if __name__ == "__main__":
jobs = []
process = multiprocessing.Process(target=worker())
process.start()
tick = multiprocessing.Process(target=export())
由于我是初学者,请多多包涵。
我要用此代码实现的目的是创建一个大数据框并从中导出一个特定的股票行情自动收录器(例如Microsoft)。当我尝试使用前100个值进行测试时,代码可以正常工作,但是对于整个数据集,如果“ i”,大约150 000条记录后我收到错误消息:
Traceback (most recent call last):
File "M:/Python/Projects/Kaggle_Finnhub_Financials/json_exporter.py", line 40, in <module>
process = multiprocessing.Process(target=worker())
File "M:/Python/Projects/Kaggle_Finnhub_Financials/json_exporter.py", line 16, in worker
df = df.append(pd.DataFrame(pd.read_json(os.path.join(dirname, filename))))
File "C:\Users\M&M\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\util\_decorators.py", line 214, in wrapper
return func(*args, **kwargs)
File "C:\Users\M&M\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\json\_json.py", line 608, in read_json
result = json_reader.read()
File "C:\Users\M&M\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\json\_json.py", line 731, in read
obj = self._get_object_parser(self.data)
File "C:\Users\M&M\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\json\_json.py", line 753, in _get_object_parser
obj = FrameParser(json, **kwargs).parse()
File "C:\Users\M&M\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\json\_json.py", line 857, in parse
self._parse_no_numpy()
File "C:\Users\M&M\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\json\_json.py", line 1089, in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None
ValueError: Value is too big
主要问题:
如何摆脱此错误并处理数据?
其他问题:
我尝试在运行它时考虑到“多进程”,因为它运行速度很慢,并且在Ryzen 2600上我的CPU使用率仅为20%左右,但没有发现任何重大变化。我做错了什么吗?在执行具有多处理功能的最后一段之前,代码具有相同的错误消息。
代码中是否有不良习惯?我希望它尽可能高效和清洁。
数据集真的那么大,熊猫无法处理吗?大约2 GB。我还没有找到关于熊猫DF大小限制的明确答案。
非常感谢您