从不一致命名的列创建Dataframe

时间:2017-05-26 23:21:45

标签: python python-3.x csv pandas dataframe

由于文件(.csv)具有不一致的命名,我有一个pandas.DataFrame具有冗余列名。这导致列中的NaN值大多为

Bike #  Bikenumber  Bike#   SubscriberType  SubscriptionType
 NaN       NaN     W20848      NaN             Subscriber
 NaN       NaN     W20231      NaN             Subscriber
 NaN       NaN     W00785      NaN             Subscriber
 NaN       NaN     W00126      NaN             Subscriber
 NaN       NaN     W20929      NaN             Casual

有没有办法创建新列并从具有值的多个列填充它?如果多个列不是NaN,我可以选择从哪个列中提取值吗?

 Bike#   Bikenumber   Bike #   Selected_Num
number1   number2      NaN       number2

尝试填充单列时,我可以得到这个

sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'])
print(sample)

    Bike #  Bikenumber  Bike#   SubscriberType  SubscriptionType   Bike_Num
     NaN       NaN     W20848      NaN             Subscriber       W20848
     NaN       NaN     W20231      NaN             Subscriber       W20231
     NaN       NaN     W00785      NaN             Subscriber       W00785
     NaN       NaN     W00126      NaN             Subscriber       W00126
     NaN       NaN     W20929      NaN             Casual           W20929

失败
sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'], sample['Bikenumber'])

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

1 个答案:

答案 0 :(得分:1)

我建议您在阅读CSV时解决此问题,而不是稍后尝试解开它们。一种方法是在将CSV文件传递给pandas之前使用一个小的解析器。

此解析器获取csv的打开文件句柄,以及将所需列名映射到各种可能同义词的dict。

<强>代码:

def read_my_csv(file_handle, column_map):
    # reverse the column mapping dict to use for synonym lookup
    synoms = dict(sum([
        [(syn, k) for syn in v] for k, v in column_map.items()], []))

    # build csv reader
    reader = csv.reader(file_handle)

    # get the header, and map columns to desired names
    header = next(reader)
    header = [synoms.get(c, c) for c in header]

    # yield the header
    yield header

    # yield the remaining rows
    for row in reader:
        yield row

测试代码:

import pandas as pd
import csv

column_map = {
    'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'),
    'Sub_Num': ('SubscriberType', 'SubscriptionType'),
}

with open("sample.csv", 'rU') as f:
    generator = read_my_csv(f, column_map)
    columns = next(generator)
    df = pd.DataFrame(generator, columns=columns)

print(df)

<强> Sample.csv:

Bike #,SubscriptionType
W20848,Subscriber
W20231,Subscriber
W00785,Subscriber
W00126,Subscriber
W20929,Casual

<强>结果:

  Bike_Num     Sub_Num
0   W20848  Subscriber
1   W20231  Subscriber
2   W00785  Subscriber
3   W00126  Subscriber
4   W20929      Casual

解决方案#2

一个更干净,但不是那么有趣的解决方案是在执行concat之前重命名列:

<强>代码:

def fix_column_names(df, column_map):
    # reverse the column mapping dict to use for synonym lookup
    synoms = dict(sum([
        [(syn, k) for syn in v] for k, v in column_map.items()], []))

    # rename columns
    df.columns = [synoms.get(c, c) for c in df.columns]

测试代码:

import pandas as pd
import csv

column_map = {
    'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'),
    'Sub_Num': ('SubscriberType', 'SubscriptionType'),
}

df = pd.read_csv('sample.csv', header=0)
fix_column_names(df, column_map)
print(df)