将对象转换为pandas中的字符串后对象的键错误?

时间:2017-10-20 06:08:01

标签: python pandas dataframe

我有三个csv文件,我们可以调用a,b和c。文件a具有包括邮政编码的地理信息。文件b具有统计数据。文件c只有邮政编码。

我使用pandas将ab转换为我曾经加入的数据帧(a_dfb_df),这些信息是这两个数据帧之间的共享列(intermediate_df)。读取文件c并将其转换为将zipcode作为整数类型的数据框。我不得不将其转换为字符串,因此zipcodes不会被视为整数。但是,c_df将该列转换为字符串后将该列视为对象,这意味着我无法在c_df和intermediate_df之间进行连接以生成final_df。

说明我的意思:

a_data = pd.read_csv("a.csv")
b_data = pd.read_csv("b.csv", dtype={'zipcode': 'str'})
a_df = pd.DataFrame(a_data)
b_df = pd.DataFrame(b_data)

# file c conversion
c_data = pd.read_csv("slcsp.csv", dtype={'zipcode': 'str'})
print ("This is c data types: ", c_data.dtypes)
c_conversion = c_data['zipcode'].apply(str)
print ("This is c_conversion data types: ", c_conversion.dtypes)
c_df = pd.DataFrame(c_conversion)
print ("This is c_df data types: ", c_df.dtypes)

# Joining on the two common columns to avoid duplicates
joined_ab_df = pd.merge(a_df, a_df, on =['state', 'area'])

# Dropping columns that are not needed anymore
ab_for_analysis_df = joined_ab.drop(['county_code','name', 'area'], axis=1)

# Time to analyze this dataframe. Let's pick out only the silver values for 
a specific attribute
silver_only_df = (ab_for_analysis_df[filtered_df.metal_name == 'Silver'])

# Getting second lowest value of silver only
sorted_silver = silver_only_df.groupby('zipcode')['rate'].nsmallest(2)
sorted_silver_df = sorted_silver.to_frame()

print ("We cleaned up our data. Let's join the dataframes.")
print ("Final result...")
print (c_df.dtypes)
final_df = pd.merge(sorted_silver_df,c_df, on ='zipcode')

这是我们在运行它之后得到的:

This is c_data types:  zipcode     object
rate       float64
dtype: object
This is c_conversion_data types:  object
This is c_df data types:  zipcode    object
dtype: object
zipcode    object
dtype: object

We cleaned up our data. Let's join the dataframes.
This is the final result...
KeyError: 'zipcode'

知道为什么它改变了数据类型,然后如何修复它以便最终加入?

1 个答案:

答案 0 :(得分:2)

如果转换为str,则始终输出dtype是对象。

要检查strings,请检查type

print (c_data['zipcode'].apply(type))

到你的上一个错误:

需要reset_index,因为其他zipcode是索引,而不是列:

sorted_silver_df = silver_only_df.groupby('zipcode')['rate'].nsmallest(2).reset_index()
final_df = pd.merge(sorted_silver_df,c_df, on ='zipcode')

谢谢,Andy替代(未经测试):

sorted_silver_df = silver_only_df.groupby('zipcode', as_index=False)['rate'].nsmallest(2)
final_df = pd.merge(sorted_silver_df,c_df, on ='zipcode')

或者在merge中使用left_index=Trueriht_on

sorted_silver = silver_only_df.groupby('zipcode')['rate'].nsmallest(2)
sorted_silver_df = sorted_silver.to_frame()
final_df = pd.merge(sorted_silver_df,c_df, right_on ='zipcode', left_index=True)