Question

我有一个数据框如下：

df

    id   val
0    1   21
1    2   35
2    2   45
3    3   55
4    1   10
5    4   90
6    3   45
7    2   78
8    3   23

我想根据cat中每个值的长度创建一个新列id。

如果cat中的len（id）＆lt; = 1值应为'A'

如果len（id）＆lt; 3值应为'B'

如果len（id）＆gt; = 3值应为'C'

预期产出：

    id   val   cat
0    1   21     B
1    2   35     C
2    2   45     C
3    3   55     C
4    1   10     B
5    4   90     A
6    3   45     C
7    2   78     C
8    3   23     C

我尝试了什么：

def test(series):
    if len(series) <= 1:
        return 'A'
    elif len(series) < 3:
        return 'B'
    else:
        return 'C'


df.groupby('id').apply(test)

以上代码出错：

TypeError：'DataFrame'对象是可变的，因此无法进行散列

Answer 1

以下是使用pd.Series.value_counts和np.where的更简单方法。

尽可能避免使用pd.Series.apply，因为这只是一个薄薄的循环。只有当你能矢量化计算时，大熊猫的力量才会显而易见。

df['count'] = df['id'].map(df['id'].value_counts())

df['cat'] = np.where(df['count'] <= 1, 'A',
                     np.where(df['count'] < 3, 'B', 'C'))

#    id  val  count cat
# 0   1   21      2   B
# 1   2   35      3   C
# 2   2   45      3   C
# 3   3   55      3   C
# 4   1   10      2   B
# 5   4   90      1   A
# 6   3   45      3   C
# 7   2   78      3   C
# 8   3   23      3   C

Answer 2

您可以使用map，value_counts和pd.cut：

 df['cat'] = df.id.map(pd.cut(df.id.value_counts(),
                              bins=[0,1,2,np.inf],
                              labels=['A','B','C']))

输出：

   id  val cat
0   1   21   B
1   2   35   C
2   2   45   C
3   3   55   C
4   1   10   B
5   4   90   A
6   3   45   C
7   2   78   C
8   3   23   C

Answer 3

您可以将id列的值计数作为

$pinfo = New-Object System.Diagnostics.ProcessStartInfo
$pinfo.FileName = $file
$pinfo.UseShellExecute = $false
$pinfo.Arguments = $argList
$p = New-Object System.Diagnostics.Process
$p.StartInfo = $pinfo
$p.Start() | Out-Null
$p.WaitForExit()    
Write-Host "exit code: " + $p.ExitCode

将其转换为pandas DataFrame并执行左连接

x=df['id'].value_counts()
x=pd.DataFrame(x)
x.columns=['id','cat']

Answer 4

使用Python 3.6.2和Pandas 0.21.0时，我无法重现您的错误。您的原始代码按预期工作，可与pandas.merge一起使用以获得所需的输出：

In [2]: pandas.__version__
Out[2]: '0.21.0'

In [3]: df = pandas.DataFrame({
    'id': [1, 2, 2, 3, 1, 4, 3, 2, 3], 
    'val': [21, 35, 45, 55, 10, 90, 45, 78, 23]
})

In [4]: def test(series):
   ...:     if len(series) <= 1:
   ...:         return 'A'
   ...:     elif len(series) < 3:
   ...:         return 'B'
   ...:     else:
   ...:         return 'C'
   ...:         

In [5]: pandas.merge(
    df, 
    pandas.DataFrame({'cat': df.groupby('id').apply(test)}),
    left_on='id', 
    right_index=True, 
    how='right'
)
Out[5]: 
   id  val cat
0   1   21   B
4   1   10   B
1   2   35   C
2   2   45   C
7   2   78   C
3   3   55   C
6   3   45   C
8   3   23   C
5   4   90   A

In [6]: df.groupby('id').apply(test)
Out[6]: 
id
1    B
2    C
3    C
4    A
dtype: object

根据其他列中唯一值的长度在pandas中创建新列

4 个答案: