我正在尝试建立一个数据帧,从多个文件中获取数据。我创建了一个具有所需形状的空数据框,但是在获取数据时遇到了麻烦。我找到了this,但是当我进行合并时,我仍然得到NaN值。
Edit2:我更改了df
的创建顺序,并将concat放入了for循环和相同的结果中。 (出于明显的原因)
import pandas as pd
import os
import glob
def daily_country_framer():
# create assignments
country_source = r"C:\Users\USER\PycharmProjects\Corona Stats\Country Series"
list_of_files = glob.glob(country_source + r"\*.csv")
latest_file = max(list_of_files, key=os.path.getctime)
last_frame = pd.read_csv(latest_file)
date_list = []
label_list = []
# build date_list values
for file in os.listdir(country_source):
file = file.replace('.csv', '')
date_list.append(file)
# build country_list values
for country in last_frame['Country']:
label_list.append(country)
# create dataframe for each file in folder
for filename in os.listdir(country_source):
filepath = os.path.join(country_source, filename)
if not os.path.isfile(filepath):
continue
df1 = pd.read_csv(filepath)
df = pd.DataFrame(index=label_list, columns=date_list)
df1 = pd.concat([df])
print(df1)
daily_country_framer()
两个示例数据框:(请注意不同的形状)
Country Confirmed Deaths Recovered
0 World 1595350 95455 353975
1 Afghanistan 484 15 32
2 Albania 409 23 165
3 Algeria 1666 235 347
4 Andorra 583 25 58
.. ... ... ... ...
180 Vietnam 255 0 128
181 West Bank and Gaza 263 1 44
182 Western Sahara 4 0 0
183 Zambia 39 1 24
184 Zimbabwe 11 3 0
[185 rows x 4 columns]
Country Confirmed Deaths Recovered
0 World 1691719 102525 376096
1 Afghanistan 521 15 32
2 Albania 416 23 182
3 Algeria 1761 256 405
4 Andorra 601 26 71
.. ... ... ... ...
181 West Bank and Gaza 267 2 45
182 Western Sahara 4 0 0
183 Yemen 1 0 0
184 Zambia 40 2 25
185 Zimbabwe 13 3 0
[186 rows x 4 columns]
当前输出:
01-22-2020 01-23-2020 ... 04-09-2020 04-10-2020
World NaN NaN ... NaN NaN
Afghanistan NaN NaN ... NaN NaN
Albania NaN NaN ... NaN NaN
Algeria NaN NaN ... NaN NaN
Andorra NaN NaN ... NaN NaN
... ... ... ... ... ...
West Bank and Gaza NaN NaN ... NaN NaN
Western Sahara NaN NaN ... NaN NaN
Yemen NaN NaN ... NaN NaN
Zambia NaN NaN ... NaN NaN
Zimbabwe NaN NaN ... NaN NaN
[186 rows x 80 columns]
所需的输出:(其中NaN等于目标列或所有列的列表中的对应值,即:如果['Confirmed']则为0,1,2,3,4,如果全部为[0,0,0] ,[1,0,0],[2,0,0])
答案 0 :(得分:1)
您的代码(带有嵌入式注释):
import pandas as pd
import os
import glob
def daily_country_framer():
# create assignments
country_source = r"C:\Users\USER\PycharmProjects\Corona Stats\Country Series"
list_of_files = glob.glob(country_source + r"\*.csv")
latest_file = max(list_of_files, key=os.path.getctime)
last_frame = pd.read_csv(latest_file)
date_list = []
label_list = []
# build date_list values
for file in os.listdir(country_source):
file = file.replace('.csv', '')
date_list.append(file)
# build country_list values
for country in last_frame['Country']: # == last_frame['Country'].tolist()
label_list.append(country)
# create dataframe for each file in folder
for filename in os.listdir(country_source):
filepath = os.path.join(country_source, filename)
if not os.path.isfile(filepath):
continue
df1 = pd.read_csv(filepath)
# you redefine df1 for every file in the loop. So if there
# are 10 files, only the last one is actually used anywhere
# outside this loop.
df = pd.DataFrame(index=label_list, columns=date_list)
df1 = pd.concat([df])
# here you just redefined df1 again as the concatenation of the
# empty dataframe you just created in the line above.
print(df1)
daily_country_framer()
因此,希望能阐明您获得结果的原因。它确实在执行您要求的操作。
您要执行的操作是获得一个字典,将日期作为键,并将相关的数据框作为值,然后将其连接起来。这可能是相当昂贵的,因为一些关于熊猫如何级联的怪癖,但是如果沿着axis = 0进行级联,就可以了。
更好的方法可能是:
import pandas as pd
import os
def daily_country_framer(country_source):
accumulator = {}
# build date_list values
for filename in os.listdir(country_source):
date = filename.replace('.csv', '')
filepath = os.path.join(country_source, filename)
accumulator[date] = pd.read_csv(filepath)
# now we have a dictionary of {date : data} -- perfect!
df = pd.concat(accumulator)
return df
daily_country_framer("C:\Users\USER\PycharmProjects\Corona Stats\Country Series")
行得通吗?