Question

我在Pandas Dataframe中有一个列，如:(其value_counts如下所示）

1                      246804
2                      135272
5                        8983
8                        3459
4                        3177
6                        1278
9                         522
D                         314
E                          91
0                          29
F                          20    
Name: Admission_Source_Code, dtype: int64

正如您所看到的，它包含整数和字母。我必须编写一个函数，我必须过滤并搜索字母值。

我最初使用pd.read_excel导入此数据集，但在阅读了多个错误报告后，似乎read_excel没有选项将列显式读取为字符串。

所以我尝试使用带有dtype选项的pd.read_csv进行阅读。最初这个列默认存储为float64，现在即使我试图运行

Df_name['Admission_Source_Code'] = Df_name['Admission_Source_Code'].astype(int).astype('str')

我无法将其格式化为字符串列。

因此，当我过滤

时

Accepted[Accepted['Admission_Source_Code']==1]

它有效，但

Accepted[Accepted['Admission_Source_Code']=='E']

仍然没有返回任何结果。当我尝试在掩码中说str（column_name）时，它表示无效的文字。有人可以帮助我如何更改dtype或如何过滤字母值？

感谢。

P.S。甚至格式化为对象没有帮助

Answer 1

我认为您应该能够使用value_counts索引器过滤您的.loc[]系列，按字符串过滤（索引）

演示：

In [27]: df
Out[27]:
                        Count
Admission_Source_Code
1                      246804
2                      135272
5                        8983
8                        3459
4                        3177
6                        1278
9                         522
D                         314
E                          91
0                          29
F                          20

In [28]: df.index.dtype
Out[28]: dtype('O')

In [29]: df.loc['2']
Out[29]:
Count    135272
Name: 2, dtype: int64

In [30]: df.loc[['2','E','5','D']]
Out[30]:
                        Count
Admission_Source_Code
2                      135272
E                          91
5                        8983
D                         314

列出索引值：

In [36]: df.index.values
Out[36]: array(['1', '2', '5', '8', '4', '6', '9', 'D', 'E', '0', 'F'], dtype=object)

更新：从Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers开始。

Answer 2

我使用您的示例进行了一些测试，过滤器效果很好，例如：

df = pandas.read_csv('Yourfile.csv')
df['Admission_Source_Code'].value_counts()

1                      246804
2                      135272
5                        8983
8                        3459
4                        3177
6                        1278
9                         522
D                         314
E                          91
0                          29
F                          20    
Name: Admission_Source_Code, dtype: int64

如果我尝试：

print (df[(df['Admission_Source_Code']==1)])

我得到了：

Empty DataFrame
Columns: [Admission_Source_Code]
Index: []

但是list comprehesion

df['Admission_Source_Code'] = [str(i) for i in df['Admission_Source_Code']]

使用数据示例：

如果问题仍然存在，您是否考虑过csv列中的干净项目？ （即空白）。

例如，使用相同的list comprehesion和strip()：

df['Admission_Source_Code'] = [str(i.strip()) for i in df['Admission_Source_Code']]

如何在Python Pandas Dataframe中的混合数据类型对象中筛选字符串值

2 个答案: