根据pandas数据框中的其他分类值填充分类值的缺失值

时间:2017-10-02 20:10:59

标签: pandas

我想在Pandas数据框中填充缺失的分类值,并在另一个类别中使用最常见的值。例如,

for

导致

for

首先,我使用groupby作为

SELECT t.post_title
      ,LEFT(t.post_title, LOCATE(' ', post_title )) AS FName
      ,SUBSTR(t.post_title, LOCATE(' ', post_title)+1, LOCATE(' ',post_title,LOCATE(' ', post_title)+1)-LOCATE(' ', post_title)) AS LName
      ,REPLACE(REPLACE(TRIM(RIGHT(t.post_title,LOCATE(' ', REVERSE(post_title)))), '(', ''), ')','') AS ID
FROM (SELECT 'Bill Smith (5678)' AS post_title
      UNION SELECT 'Jan Jones (3423)'
      UNION SELECT 'Jim Tanz (7890)') t;

获取

import pandas as pd
import numpy as np
data = {'type': ['softdrink', 'juice', 'softdrink', 'softdrink',    'juice','juice','juice'],
    'product': ['coca', np.nan, 'pepsi', 'pepsi', 'orange','grape',np.nan], 
    'price': [25, 94, 57, 62, 70,50,60]}
df = pd.DataFrame(data)
df

我想用“百事可乐”填充第二行的缺失产品(最不具备意义),但填写“葡萄”以获得“果汁”类别第6行的缺失值。 如果没有分类组,我的解决方案是按列查找最常用的值,并将此值指定为缺失值。

      price     | product   |   type    
0   25          |  coca     | softdrink   
1   94          |   NaN     | juice    
2   57          |   pepsi   | softdrink    
3   62          |   pepsi   | softdrink    
4   70          |   orange  | juice    
5   50          |    grape  | juice    
6   60          |   NaN     | softdrink    

我很难完成任务,因为命令的返回值

df.groupby('type')['product'].value_counts()   

是pandas系列,可以通过

访问
type      |   product    
juice     |    grape  |   1    
          |   orange  |   1    
softdrink | pepsi     |   2    
          | coca      |   1    
Name: product, dtype: int64    

我怎么知道哪个产品+类别的频率最高。

1 个答案:

答案 0 :(得分:2)

IIUC

使用mode

数据输入

import pandas as pd
import numpy as np
data = {'type': ['softdrink', 'juice', 'softdrink', 'softdrink',    'juice','juice','softdrink'],
    'product': ['coca', np.nan, 'pepsi', 'pepsi', 'orange','grape',np.nan],
    'price': [25, 94, 57, 62, 70,50,60]}
df = pd.DataFrame(data)

溶液

df.groupby('type').product.transform(lambda x: x.fillna(x.mode()[0]))

Out[28]: 
0      coca
1     grape
2     pepsi
3     pepsi
4    orange
5     grape
6     pepsi
Name: product, dtype: object

新df

df['product']=df.groupby('type').product.transform(lambda x: x.fillna(x.mode()[0]))
df
Out[40]: 
   price product       type
0     25    coca  softdrink
1     94   grape      juice
2     57   pepsi  softdrink
3     62   pepsi  softdrink
4     70  orange      juice
5     50   grape      juice
6     60   pepsi  softdrink