Python Pandas:在列中对值进行分类并创建一个新列

时间:2018-02-08 08:54:52

标签: python pandas dataframe

快速提问。 我想在我的df中创建一个列,用于对其他列中的值进行分类。看看下面的代码。

df['maker_grp'] = np.nan
for key in df[df['maker_nm'].str.contains("Sam|Mike")].index:
    df['maker_grp'][key] = 'Class1'
for key in df[df['maker_nm'].str.contains("Andy|John|Paul|Jay")].index:
    df['maker_grp'][key] = 'Class2'
df['maker_grp'] = df.maker_grp.fillna('Class3')

它完美无缺,但我只是觉得有一种pythonic方式可以用更少的处理来做到这一点。帮帮我。感谢

2 个答案:

答案 0 :(得分:1)

使用numpy.select

package com.example.hythm.ui_practise;
import android.support.v7.app.AppCompatActivity;
import android.os.Bundle;
import android.view.View;
import android.widget.Button;
import android.widget.EditText;
import android.widget.RelativeLayout;
import android.widget.TextView;

public class MainActivity extends AppCompatActivity {

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);
        final Button PlusButton=(Button) findViewById(R.id.ButtonOpPlus);
        final Button ButtonNo1=(Button) findViewById(R.id.no1);
        final Button ButtonNo2=(Button) findViewById(R.id.no2);
        final Button ButtonNo3=(Button) findViewById(R.id.no3);
        final Button ButtonNo4=(Button) findViewById(R.id.no4);
        final Button ButtonNo5=(Button) findViewById(R.id.no5);
        final Button ButtonNo6=(Button) findViewById(R.id.no6);
        final Button ButtonNo7=(Button) findViewById(R.id.no7);
        final Button ButtonNo8=(Button) findViewById(R.id.no8);
        final Button ButtonNo9=(Button) findViewById(R.id.no9);
        final TextView ResultTextView=(TextView)findViewById(R.id.Result);
        final Button CalculateButton=(Button) findViewById(R.id.Calculate);
        final TextView tempv=new TextView(this);
        CalculateButton.setOnClickListener(
                new View.OnClickListener() {
                    @Override
                    public void onClick(View v) {
                        EditText NoOfInputsEditText=(EditText)findViewById(R.id.NoOfInputs);
                        int size = Integer.parseInt(NoOfInputsEditText.getText().toString()); // total number of TextViews to add
                        **RelativeLayout myLayout=(RelativeLayout)R.layout.activity_main;
                        EditText newEditText=new EditText(getBaseContext());
                        myLayout.addView(newEditText);**
                    }
                }
        );
    }
}

样品:

m1 = df['maker_nm'].str.contains("Sam|Mike")
m2 = df['maker_nm'].str.contains("Andy|John|Paul|Jay")

df['maker_grp'] = np.select([m1,m2], ['Class1','Class2'], default='Class3')

如果许多具有自定义功能的条件df = pd.DataFrame({'maker_nm':['Sam 1','Joe 5','Paul 7','Mike 0']}) #print (df) m1 = df['maker_nm'].str.contains("Sam|Mike") m2 = df['maker_nm'].str.contains("Andy|John|Paul|Jay") df['maker_grp'] = np.select([m1,m2], ['Class1','Class2'], default='Class3') print (df) maker_nm maker_grp 0 Sam 1 Class1 1 Joe 5 Class3 2 Paul 7 Class2 3 Mike 0 Class1 应该更快:

apply

<强>计时

import re

def f(x):
    p1 = re.compile("Sam|Mike")
    p2 = re.compile("Andy|John|Paul|Jay")
    if p1.match(x):
        return 'Class1'
    elif p2.match(x):
        return 'Class2'
    else:
        return 'Class3'

df['maker_grp'] = df['maker_nm'].apply(f)

<强>买者

性能实际上取决于数据和条件数量。

编辑:对于检查子串的许多条件,df = pd.DataFrame({'maker_nm':['Sam 1','Joe 5','Paul 7','Mike 0']}) df = pd.concat([df] * 1000, ignore_index=True) #print (df) In [117]: %%timeit ...: df['maker_grp'] = np.nan ...: for key in df[df['maker_nm'].str.contains("Sam|Mike")].index: ...: df['maker_grp'][key] = 'Class1' ...: for key in df[df['maker_nm'].str.contains("Andy|John|Paul|Jay")].index: ...: df['maker_grp'][key] = 'Class2' ...: df['maker_grp'] = df.maker_grp.fillna('Class3') ...: In [118]: %%timeit ...: m1 = df['maker_nm'].str.contains("Sam|Mike") ...: m2 = df['maker_nm'].str.contains("Andy|John|Paul|Jay") ...: ...: df['maker_grp'] = np.select([m1,m2], ['Class1','Class2'], default='Class3') ...: 100 loops, best of 3: 5.98 ms per loop In [119]: %%timeit ...: df['maker_grp'] = df['maker_nm'].apply(f) ...: 100 loops, best of 3: 7.38 ms per loop

更快
apply
m1 = df['maker_nm'].str.contains("Sam", regex=False)
m2 = df['maker_nm'].str.contains("Mike", regex=False)
m3 = df['maker_nm'].str.contains("Andy", regex=False)
m4 = df['maker_nm'].str.contains("John", regex=False)
m5 = df['maker_nm'].str.contains("Jay", regex=False)

df['maker_grp'] = np.select([m1,m2,m3,m4,m5], ['Class1','Class1', 'Class2','Class2','Class2'], default='Class3')
print (df)

def f(x):

    if 'Sam' in x:
        return 'Class1'
    elif 'Mike' in x:
        return 'Class1'
    elif 'Andy' in x:
        return 'Class2'
    elif 'John' in x:
        return 'Class2'
    elif 'Paul' in x:
        return 'Class2'
    elif 'Jay' in x:
        return 'Class2'  
    else:
        return 'Class3'

df['maker_grp'] = df['maker_nm'].apply(f)
print (df)

答案 1 :(得分:1)

我认为这可以用熊猫非常简洁地完成。这应该比使用for循环迭代每个键更快。

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'maker_nm':['Sam 1','Joe 5','Paul 7','Mike 0']})

In [3]: conditions = {'Sam|Mike': 'Class1', 'Andy|John|Paul|Jay': 'Class2'}

In [4]: df.join(pd.concat([df[df.maker_nm.str.contains(c)].assign(maker_grp=conditions[c])
   ...:                    for c in conditions]).maker_grp).fillna('Class3')
   ...:                    
Out[4]: 
  maker_nm maker_grp
0    Sam 1    Class1
1    Joe 5    Class3
2   Paul 7    Class2
3   Mike 0    Class1