我正在阅读大型pickle文件到pandas dataframe,我加载了其中一个并且我按照我需要的方式加载它。但是,我有一个包含40个pickle文件的文件夹,名为imdbnames0.pkl,imdbnames1.pkl,imdbnames2.pkl,....,imdbnames40.pkl。
我想以类似的方式加载它们,并在sinlge pandas数据框中完全合并它们。
fh = open("ethnicity_files/imdbnames1.pkl", 'rb')
d = pickle.load(fh)
df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
df.head()
names ethnicity score best
0 !Gubi Tietie Asian 0.03 GreaterEuropean
1 !Gubi Tietie GreaterAfrican 0.01 GreaterEuropean
2 !Gubi Tietie GreaterEuropean 0.96 GreaterEuropean
3 !Gubi Tietie British 0.17 WestEuropean
4 !Gubi Tietie Jewish 0.13 WestEuropean
5 !Gubi Tietie WestEuropean 0.65 WestEuropean
6 !Gubi Tietie EastEuropean 0.05 WestEuropean
7 !Gubi Tietie Nordic 0.00 Italian
8 !Gubi Tietie Italian 0.69 Italian
9 !Gubi Tietie Hispanic 0.12 Italian
10 !Gubi Tietie French 0.16 Italian
11 !Gubi Tietie Germanic 0.02 Italian
12 $2 Tony Asian 0.00 GreaterEuropean
13 $2 Tony GreaterAfrican 0.00 GreaterEuropean
14 $2 Tony GreaterEuropean 1.00 GreaterEuropean
15 $2 Tony British 0.00 WestEuropean
16 $2 Tony Jewish 0.00 WestEuropean
17 $2 Tony WestEuropean 1.00 WestEuropean
18 $2 Tony EastEuropean 0.00 WestEuropean
19 $2 Tony Nordic 0.00 Italian
一个文件是以下https://drive.google.com/file/d/10cjsoWFJ46w-2lEsxh6hmuRZlLunatf-/view?usp=sharing。
我只想将它们全部添加到一个pandas数据帧中。
答案 0 :(得分:1)
我认为你需要os.listdir()
:
#Be careful this might give you a memory error if you
#don't have enough ram for all your files
#and make sure the folder contains only the files you want to read
import os
files = os.listdir('ethnicity_files/')
list_of_dfs = []
for file in files:
d = pickle.load(os.path.join('ethnicity_files/',file))
df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
list_of_dfs.append(df)
big_df = pd.concat(list_of_dfs, ignore_index=True)#ignore_index to reset index of big_df
big_df.head()
答案 1 :(得分:1)
您可以使用glob.glob
来迭代当前文件夹中具有特定扩展名的所有文件(在您的情况下为.pkl)
import os
import glob
cd=os.getcwd()
os.chdir('path_to_your_folder')
for file in glob.glob("*.pkl"):
fh = open(str(file), 'rb')
d = pickle.load(fh)
df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
os.chdir(cd)
print df.head()