从多个数据框熊猫构建数据框值

时间:2020-04-11 20:56:51

标签: python pandas dataframe

我正在尝试建立一个数据帧,从多个文件中获取数据。我创建了一个具有所需形状的空数据框,但是在获取数据时遇到了麻烦。我找到了this,但是当我进行合并时,我仍然得到NaN值。 Edit2:我更改了df的创建顺序,并将concat放入了for循环和相同的结果中。 (出于明显的原因)

import pandas as pd
import os
import glob

def daily_country_framer():
    # create assignments
    country_source = r"C:\Users\USER\PycharmProjects\Corona Stats\Country Series"
    list_of_files = glob.glob(country_source + r"\*.csv")
    latest_file = max(list_of_files, key=os.path.getctime)
    last_frame = pd.read_csv(latest_file)
    date_list = []
    label_list = []

    # build date_list values
    for file in os.listdir(country_source):
        file = file.replace('.csv', '')
        date_list.append(file)

    # build country_list values
    for country in last_frame['Country']:
        label_list.append(country)

    # create dataframe for each file in folder
    for filename in os.listdir(country_source):
        filepath = os.path.join(country_source, filename)
        if not os.path.isfile(filepath):
            continue
        df1 = pd.read_csv(filepath)
    df = pd.DataFrame(index=label_list, columns=date_list)
    df1 = pd.concat([df])
    print(df1)


daily_country_framer()

两个示例数据框:(请注意不同的形状)

                Country  Confirmed  Deaths  Recovered
0                 World    1595350   95455     353975
1           Afghanistan        484      15         32
2               Albania        409      23        165
3               Algeria       1666     235        347
4               Andorra        583      25         58
..                  ...        ...     ...        ...
180             Vietnam        255       0        128
181  West Bank and Gaza        263       1         44
182      Western Sahara          4       0          0
183              Zambia         39       1         24
184            Zimbabwe         11       3          0

[185 rows x 4 columns]
                Country  Confirmed  Deaths  Recovered
0                 World    1691719  102525     376096
1           Afghanistan        521      15         32
2               Albania        416      23        182
3               Algeria       1761     256        405
4               Andorra        601      26         71
..                  ...        ...     ...        ...
181  West Bank and Gaza        267       2         45
182      Western Sahara          4       0          0
183               Yemen          1       0          0
184              Zambia         40       2         25
185            Zimbabwe         13       3          0

[186 rows x 4 columns]

当前输出:

                   01-22-2020 01-23-2020  ... 04-09-2020 04-10-2020
World                     NaN        NaN  ...        NaN        NaN
Afghanistan               NaN        NaN  ...        NaN        NaN
Albania                   NaN        NaN  ...        NaN        NaN
Algeria                   NaN        NaN  ...        NaN        NaN
Andorra                   NaN        NaN  ...        NaN        NaN
...                       ...        ...  ...        ...        ...
West Bank and Gaza        NaN        NaN  ...        NaN        NaN
Western Sahara            NaN        NaN  ...        NaN        NaN
Yemen                     NaN        NaN  ...        NaN        NaN
Zambia                    NaN        NaN  ...        NaN        NaN
Zimbabwe                  NaN        NaN  ...        NaN        NaN

[186 rows x 80 columns]

所需的输出:(其中NaN等于目标列或所有列的列表中的对应值,即:如果['Confirmed']则为0,1,2,3,4,如果全部为[0,0,0] ,[1,0,0],[2,0,0])

1 个答案:

答案 0 :(得分:1)

您的代码(带有嵌入式注释):

import pandas as pd
import os
import glob

def daily_country_framer():
    # create assignments
    country_source = r"C:\Users\USER\PycharmProjects\Corona Stats\Country Series"
    list_of_files = glob.glob(country_source + r"\*.csv")
    latest_file = max(list_of_files, key=os.path.getctime)
    last_frame = pd.read_csv(latest_file)
    date_list = []
    label_list = []

    # build date_list values
    for file in os.listdir(country_source):
        file = file.replace('.csv', '')
        date_list.append(file)

    # build country_list values
    for country in last_frame['Country']: # == last_frame['Country'].tolist()
        label_list.append(country)

    # create dataframe for each file in folder
    for filename in os.listdir(country_source):
        filepath = os.path.join(country_source, filename)
        if not os.path.isfile(filepath):
            continue
        df1 = pd.read_csv(filepath)
        # you redefine df1 for every file in the loop. So if there
        # are 10 files, only the last one is actually used anywhere
        # outside this loop.
    df = pd.DataFrame(index=label_list, columns=date_list)
    df1 = pd.concat([df])
    # here you just redefined df1 again as the concatenation of the
    # empty dataframe you just created in the line above.
    print(df1)


daily_country_framer()

因此,希望能阐明您获得结果的原因。它确实在执行您要求的操作。

您要执行的操作是获得一个字典,将日期作为键,并将相关的数据框作为值,然后将其连接起来。这可能是相当昂贵的,因为一些关于熊猫如何级联的怪癖,但是如果沿着axis = 0进行级联,就可以了。

更好的方法可能是:

import pandas as pd
import os


def daily_country_framer(country_source):
    accumulator = {}
    # build date_list values
    for filename in os.listdir(country_source):
        date = filename.replace('.csv', '')
        filepath = os.path.join(country_source, filename)
        accumulator[date] = pd.read_csv(filepath)
    # now we have a dictionary of {date : data} -- perfect!
    df = pd.concat(accumulator)
    return df


daily_country_framer("C:\Users\USER\PycharmProjects\Corona Stats\Country Series")

行得通吗?