由于文件(.csv)具有不一致的命名,我有一个pandas.DataFrame
具有冗余列名。这导致列中的NaN值大多为
Bike # Bikenumber Bike# SubscriberType SubscriptionType
NaN NaN W20848 NaN Subscriber
NaN NaN W20231 NaN Subscriber
NaN NaN W00785 NaN Subscriber
NaN NaN W00126 NaN Subscriber
NaN NaN W20929 NaN Casual
有没有办法创建新列并从具有值的多个列填充它?如果多个列不是NaN
,我可以选择从哪个列中提取值吗?
Bike# Bikenumber Bike # Selected_Num
number1 number2 NaN number2
尝试填充单列时,我可以得到这个
sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'])
print(sample)
Bike # Bikenumber Bike# SubscriberType SubscriptionType Bike_Num
NaN NaN W20848 NaN Subscriber W20848
NaN NaN W20231 NaN Subscriber W20231
NaN NaN W00785 NaN Subscriber W00785
NaN NaN W00126 NaN Subscriber W00126
NaN NaN W20929 NaN Casual W20929
失败
sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'], sample['Bikenumber'])
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
答案 0 :(得分:1)
我建议您在阅读CSV时解决此问题,而不是稍后尝试解开它们。一种方法是在将CSV文件传递给pandas
之前使用一个小的解析器。
此解析器获取csv
的打开文件句柄,以及将所需列名映射到各种可能同义词的dict。
<强>代码:强>
def read_my_csv(file_handle, column_map):
# reverse the column mapping dict to use for synonym lookup
synoms = dict(sum([
[(syn, k) for syn in v] for k, v in column_map.items()], []))
# build csv reader
reader = csv.reader(file_handle)
# get the header, and map columns to desired names
header = next(reader)
header = [synoms.get(c, c) for c in header]
# yield the header
yield header
# yield the remaining rows
for row in reader:
yield row
测试代码:
import pandas as pd
import csv
column_map = {
'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'),
'Sub_Num': ('SubscriberType', 'SubscriptionType'),
}
with open("sample.csv", 'rU') as f:
generator = read_my_csv(f, column_map)
columns = next(generator)
df = pd.DataFrame(generator, columns=columns)
print(df)
<强> Sample.csv:强>
Bike #,SubscriptionType
W20848,Subscriber
W20231,Subscriber
W00785,Subscriber
W00126,Subscriber
W20929,Casual
<强>结果:强>
Bike_Num Sub_Num
0 W20848 Subscriber
1 W20231 Subscriber
2 W00785 Subscriber
3 W00126 Subscriber
4 W20929 Casual
解决方案#2
一个更干净,但不是那么有趣的解决方案是在执行concat之前重命名列:
<强>代码:强>
def fix_column_names(df, column_map):
# reverse the column mapping dict to use for synonym lookup
synoms = dict(sum([
[(syn, k) for syn in v] for k, v in column_map.items()], []))
# rename columns
df.columns = [synoms.get(c, c) for c in df.columns]
测试代码:
import pandas as pd
import csv
column_map = {
'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'),
'Sub_Num': ('SubscriberType', 'SubscriptionType'),
}
df = pd.read_csv('sample.csv', header=0)
fix_column_names(df, column_map)
print(df)