我想在Pandas数据框中填充缺失的分类值,并在另一个类别中使用最常见的值。例如,
for
导致
for
首先,我使用groupby作为
SELECT t.post_title
,LEFT(t.post_title, LOCATE(' ', post_title )) AS FName
,SUBSTR(t.post_title, LOCATE(' ', post_title)+1, LOCATE(' ',post_title,LOCATE(' ', post_title)+1)-LOCATE(' ', post_title)) AS LName
,REPLACE(REPLACE(TRIM(RIGHT(t.post_title,LOCATE(' ', REVERSE(post_title)))), '(', ''), ')','') AS ID
FROM (SELECT 'Bill Smith (5678)' AS post_title
UNION SELECT 'Jan Jones (3423)'
UNION SELECT 'Jim Tanz (7890)') t;
获取
import pandas as pd
import numpy as np
data = {'type': ['softdrink', 'juice', 'softdrink', 'softdrink', 'juice','juice','juice'],
'product': ['coca', np.nan, 'pepsi', 'pepsi', 'orange','grape',np.nan],
'price': [25, 94, 57, 62, 70,50,60]}
df = pd.DataFrame(data)
df
我想用“百事可乐”填充第二行的缺失产品(最不具备意义),但填写“葡萄”以获得“果汁”类别第6行的缺失值。 如果没有分类组,我的解决方案是按列查找最常用的值,并将此值指定为缺失值。
price | product | type
0 25 | coca | softdrink
1 94 | NaN | juice
2 57 | pepsi | softdrink
3 62 | pepsi | softdrink
4 70 | orange | juice
5 50 | grape | juice
6 60 | NaN | softdrink
我很难完成任务,因为命令的返回值
df.groupby('type')['product'].value_counts()
是pandas系列,可以通过
访问type | product
juice | grape | 1
| orange | 1
softdrink | pepsi | 2
| coca | 1
Name: product, dtype: int64
我怎么知道哪个产品+类别的频率最高。
答案 0 :(得分:2)
IIUC
使用mode
数据输入
import pandas as pd
import numpy as np
data = {'type': ['softdrink', 'juice', 'softdrink', 'softdrink', 'juice','juice','softdrink'],
'product': ['coca', np.nan, 'pepsi', 'pepsi', 'orange','grape',np.nan],
'price': [25, 94, 57, 62, 70,50,60]}
df = pd.DataFrame(data)
溶液
df.groupby('type').product.transform(lambda x: x.fillna(x.mode()[0]))
Out[28]:
0 coca
1 grape
2 pepsi
3 pepsi
4 orange
5 grape
6 pepsi
Name: product, dtype: object
新df
df['product']=df.groupby('type').product.transform(lambda x: x.fillna(x.mode()[0]))
df
Out[40]:
price product type
0 25 coca softdrink
1 94 grape juice
2 57 pepsi softdrink
3 62 pepsi softdrink
4 70 orange juice
5 50 grape juice
6 60 pepsi softdrink