Question

我有ICD 9代码，它们是整数和字符串的混合。我需要将代码分组。作为字符串的代码进入一个名为'abc'的组，然后根据代码所属的值范围对纯数字代码进行分组。我尝试了很多方法而且没有运气，下面是一对夫妇。

a=pd.Series(['v2',2,7,22,'v4'])
print (a.dtype)
a=a.apply(lambda x: 'abc' if x[0]=='v' else x)
a=a.apply(lambda x: 'def' if x>=1 and x<10 else x)
a=a.apply(lambda x: 'ghi' if x>=10 and x<30 else x)

这给了我错误信息：

'int' object is not subscriptable

我也尝试过：

a=pd.Series(['v2',2,7,22,'v4'])
print (a.dtype)
a=a.apply(lambda x: 'abc' if x.astype(str).str[0]=='v' else x)
a=a.apply(lambda x: 'def' if x.astype(int)>=1 and x.astype(int)<10 else x)
a=a.apply(lambda x: 'ghi' if x.astype(int)>=10 and x.astype(int)<30 else x)

收到错误消息：

'str' object has no attribute 'as type'

感谢您的帮助。我需要使用pandas，因为这是更大的数据框架的一部分。还有一个复杂的问题，我有一些代码以'e'开头，有些代码以'v'开头，他们需要进入不同的类别。尽管如此，当我使用to_numeric与我的数据框时，它不会将列中的数字元素转换为数字数据类型。（下面的代码是指我的实际数据，其中diag_1等指的是列名，diabetic_data是数据框。

    list_diag=['diag_1','diag_2','diag_3']
    for i in list_diag:
    pd.to_numeric(diabetic_data[i],errors='coerce').fillna(-1)
    print(diabetic_data[i].dtype)

为什么数据类型没有转换的任何想法？目前它正在将列中的元素作为字符串处理，因为当我尝试'is instance（x，str）'时，所有列都有效地转换为'abc'。

Answer 1

我会使用pd.cut()方法：

In [15]: a
Out[15]:
0    v2
1     2
2     7
3    22
4    v4
dtype: object

In [16]: pd.cut(pd.to_numeric(a, errors='coerce').fillna(-1),
    ...:        bins=[-np.inf, -1, 9, np.inf],
    ...:        labels=['abc','def','ghi']
    ...: )
    ...:
Out[16]:
0    abc
1    def
2    def
3    ghi
4    abc
dtype: category
Categories (3, object): [abc < def < ghi]

注意：此解决方案假设您在系列中没有负数

<强>解释

首先让所有非数字值替换为-1：

In [17]: pd.to_numeric(a, errors='coerce').fillna(-1)
Out[17]:
0    -1.0
1     2.0
2     7.0
3    22.0
4    -1.0
dtype: float64

现在我们可以使用pd.cut()

对分类进行分类

In [18]: pd.cut(pd.to_numeric(a, errors='coerce').fillna(-1),
    ...:        bins=[-np.inf, -1, 9, np.inf],
    ...:        labels=['abc','def','ghi']
    ...: )
    ...:
Out[18]:
0    abc
1    def
2    def
3    ghi
4    abc
dtype: category
Categories (3, object): [abc < def < ghi]

更新：这是一个更通用的解决方案（感谢提示的@Boud！），这也适用于负数

来源DF：

In [34]: x
Out[34]:
   val
0   v2
1  -10
2   -1
3    0
4   v5
5    9
6   10
7   13
8   22
9   v4

In [35]: x.assign(
    ...:    cat=pd.cut(pd.to_numeric(x['val'], errors='coerce').fillna(-np.inf),
    ...:        bins=[-np.inf, np.iinfo(np.int64).min, -1, np.inf],
    ...:        labels=['NaN','<0','>=0'],
    ...:        include_lowest=True))
    ...:
Out[35]:
   val  cat
0   v2  NaN
1  -10   <0
2   -1   <0
3    0  >=0
4   v5  NaN
5    9  >=0
6   10  >=0
7   13  >=0
8   22  >=0
9   v4  NaN

Answer 2

您正在尝试测试类型，但您正在使用错误的（不存在的）功能。以下是在遵循算法风格的同时实现它的方法：

a.apply(lambda x: 'abc' if isinstance(x, str) else
                  'def' if x>=1 and x<10 else
                  'ghi' if x>=10 and x<30 else x)
Out[31]: 
0    abc
1    def
2    def
3    ghi
4    abc
dtype: object

请注意，为了便于阅读，我建议使用pd.cut的MaxU方法。

Answer 3

不使用pandas

列出包含isinstance（）

按类型选择元素的结构

itertools.groupby（）需要一个关键功能，我刚做了一个

a = ['v2',2,7,22,'v4', 77, 'fred']

a_strs = [e for e in a if isinstance(e, str)]

print('strings: ', a_strs)

a_ints = [e for e in a if isinstance(e, int)]

print('ints: ', a_ints)


from itertools import groupby

groups = [list(g) for k,g in groupby(a_ints, key=lambda x: x//10)]

print('group by decade ', groups)

strings:  ['v2', 'v4', 'fred']
ints:  [2, 7, 22, 77]
group by decade  [[2, 7], [22], [77]]

如何将函数应用于具有混合数字和整数的系列

3 个答案: