基于多列的最佳合并方式(分类值)

时间:2019-09-13 02:28:22

标签: python python-3.x pandas dataframe

我需要将两列中的值合并到另一列中。

假设以下是我的熊猫df:

data = {'material':['Matl_A', 'Matl_B', 'Matl_B', 'Matl_A'], 
        'strength':[10, 20, 30, 100]  
df = pd.DataFrame(data)

所以我的df是:

  material   strength  
 ---------- ---------- 
  Matl_A           10  
  Matl_B           20  
  Matl_B           30  
  Matl_A          100  

我想做这样的事情:

  material   strength    grade
 ---------- ---------- ---------
  Matl_A           10       1
  Matl_B           20       4
  Matl_B           80       5
  Matl_A          100       2

什么是最好的方法?

编辑:

我在下面使用了迈克尔·加德纳(Michael Gardner)的答案,并进行了扩展,因为我们材料很多。希望此修订版可以提供更清晰的画面。如果我有20种需要分类的条件范围不同的材料,那将是一种更优雅的方法:

    import numpy as np
    import pandas as pd

    strength = np.random.randint(low=1, high=30, size=20)
    material = ['matl_a', 'matl_b', 'matl_b', 'matl_a', 'matl_d',
                'matl_b', 'matl_d', 'matl_a', 'matl_a', 'matl_b',
                'matl_a', 'matl_b', 'matl_e', 'matl_a', 'matl_c',
                'matl_b', 'matl_c', 'matl_a', 'matl_a', 'matl_b']

    data = {'material':material, 
            'strength':strength } 
    df = pd.DataFrame(data)

    def grading(df):
        if df['material'] == 'matl_a':
            if 0 <= df['strength'] <=10:
                return 1
            elif 11 <= df['strength'] <= 20:
                return 2
            elif 21 <= df['strength'] <= 30:
                return 3
            elif 31 <= df['strength'] <= 40:
                return 4
            else:
                return 5
        elif df['material'] == 'matl_b':
            if 0 <= df['strength'] <=10:
                return 6
            elif 11 <= df['strength'] <= 20:
                return 7
            elif 21 <= df['strength'] <= 30:
                return 8
            elif 31 <= df['strength'] <= 40:
                return 9
            else:
                return 10
        elif df['material'] == 'matl_c':
            if 0 <= df['strength'] <=10:
                return 11
            elif 11 <= df['strength'] <= 20:
                return 12
            elif 21 <= df['strength'] <= 30:
                return 13
            elif 31 <= df['strength'] <= 40:
                return 14
            else:
                return 15        
        else:
            if 0 <= df['strength'] <=10:
                return 16
            elif 11 <= df['strength'] <= 20:
                return 17
            elif 21 <= df['strength'] <= 30:
                return 18
            elif 31 <= df['strength'] <= 40:
                return 19
            else:
                return 20

    df['grade'] = df.apply(grading, axis=1)

3 个答案:

答案 0 :(得分:2)

使用np.select

a = df.material.eq('Matl_A')
b = df.material.eq('Matl_B')

df['grade'] = np.select([a & df.strength.between(5,10),
                         a & df.strength.between(11,20),
                         b & df.strength.between(10,50),
                         b & df.strength.between(50,100)],
                         ['A', 'B', 'A', 'B'],
                         default='C')

答案 1 :(得分:1)

IN:

data = {'material':['Matl_A', 'Matl_B', 'Matl_B', 'Matl_A'], 
        'strength':[10, 20, 80, 100] } 

df = pd.DataFrame(data)

def grading(df):
    if df['material'] == 'Matl_A':
        if 5 <= df['strength'] <= 10:
            return 'A'
        elif 11 <= df['strength'] <= 20:
            return 'B'
        else:
            return 'C'
    elif 10 <= df['strength'] <= 50:
        return 'A'
    elif 50 <= df['strength'] <= 100:
        return 'B'
    else:
        return 'C'

df['grade'] = df.apply(grading, axis=1)

df.head()

OUT:

| material | strength | grade |
|----------|----------|-------|
| Matl_A   | 10       | A     |
| Matl_B   | 20       | A     |
| Matl_B   | 80       | B     |
| Matl_A   | 100      | C     |

答案 2 :(得分:1)

将成绩定义放入df。

grades = pd.DataFrame([
    ('Matl_A', 5, 'A'),
    ('Matl_A', 11, 'B'),
    ('Matl_A', 21, 'C'),
    ('Matl_B', 10, 'A'),
    ('Matl_B', 51, 'B'),
    ('Matl_B', 101, 'C'),
], columns=('material', 'strength', 'grade'))
grades = grades.sort_values(['strength'])

然后使用pd.merge_asof

pd.merge_asof(df, grades, on='strength', by='material')

可以从外部源(css或db等)加载成绩定义。
这样可以处理大量的材料和等级平板而不会造成混乱。