Question

我不确定实现这一目标的完美方法是什么：

我有多个 xlsx 文件，并且 customer_id 列在每个文件中都有不同的名称。假设以下示例：

xlsx1: customer_id
xlsx2: ID
slsx3: client_ID
xlsx4: cus_id
xlsx5: consumer_number
xlsx6: customer_number
...etc

我想读取文件夹中的所有 xlsx，然后提取客户 ID 列并将它们附加到一个数据帧中。

到目前为止我做了什么：

我为 xlsx 文件中每个预期的 customer_id 列创建了一个列表：

customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]

然后我读取文件夹中的所有xlsx文件：

all_data = pd.DataFrame()
for f in glob.glob("./*.xlsx"):
    df = pd.read_excel(f, usecols = customer_id)
    all_data = all_data.append(df,ignore_index=True)

这里我得到了错误：

ValueError: Usecols do not match columns, columns expected but not found:

我相信 usecols 匹配每个 xlsx 文件中列表中的所有列名称，而我需要获取与名称匹配的 xlsx 文件中的一列。

Answer 1

一种方法是读取完整的 excel，reindex 和 customer_id 中可能的 ID 列，这将为错误的名称生成 nan 列，然后 dropna 它们。为以后的 concat 重命名该列。也不要在循环中使用 pandas append，将 append 用于列表，然后使用 concat，它会更快。所以你得到：

l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
    df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
    df.columns = ["ID"] # to have only one column once concat
    l.append(df)
all_data  = pd.concat(l, ignore_index=True) # concat all data

熊猫使用cols并从多个数据帧追加

1 个答案: