从两列切片数据并在Pandas Advice中输出新值

时间:2018-05-15 09:43:08

标签: python pandas csv dataframe

让我们假设我们有以下数据帧:

import pandas as pd
df = pd.read_csv('subjects.csv')
Col A,              Interest, Col Start, Col Go, Col Learn,
Learn English Lit           
Go Mathematics      
Start Science       
Learn Science       
Go English          
Start Math          
Learn Math          
Go Biology          
Start English       

我已经编写了一些代码来从类似的数据集中提取兴趣,如下所示

#Map Interests 
Mapper = ['English', 'Math', 'Maths', 'Mathematics', 'Biology', 'Science'] 
#Join Mapper to Interest Column
pat = '|'.join(r"\b{}\b".format(x) for x in Mapper)
df['interest'] = df['col A'].str.extract('('+ pat + ')', expand=False)


#Align Interest Names by creating a dict and replacing values
enter code here
d = {'English Lit' : 'English', 'Biology' : 'Science', 'Mathematics' : 'Maths'} 
df['Interests'] = df['Interests'].replace(d, inplace=False)

>>> Output:

Col A,              Interest, Col Start, Col Go, Col Learn,
    Learn English Lit   English         
    Go Mathematics      Maths
    Start Science       Science
    Learn Science       Science
    Go English          English
    Start Math          Maths
    Learn Math          Maths
    Go Biology          Science
    Start English       English 

现在,我需要衡量Col A与关键字和兴趣的互动情况。

我已按照以下方式完成此操作,但我确信有更好的方法可以做到这一点。

df['Col Start'][df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Science")] = 'Learn'

此外,将多个值附加到一列中的最佳方法是什么?例如,如果我有:

Col A                         
Learn Science, Math, Biology.

我希望将关键字+兴趣映射到一个新列,其值以逗号分隔。这是我当前的脚本崩溃的地方,它写了以前的新值,我试图捕获所有参与级别(如果这是有道理的..)

Col A                         Col B
Learn Science, Math, Biology. Learn S, Learn, M, Learn B

任何帮助都会令人感激。 (请温柔,我在2月开始编码!)

编辑清晰度:

df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Science"), 'Col Start'] = 'Learn S'
df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("English"), 'Col Start'] = 'Learn E'
df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Math"), 'Col Start'] = 'Learn M'


Col A                Col Learn
Learn Science, Math  Learn S, Learn M
Learn Math, English  Learn M, Learn E
Learn Science        Learn S.

在我的DF中,可能是Col A&利息可能重叠并产生经常性产出。我想要的是捕获它们而不是覆盖它们但是用逗号附加任何新输入。

1 个答案:

答案 0 :(得分:1)

我认为需要findall如果需要按列表整理提取所有值,并join追加字符串Learn

#better is use loc for set new column
df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Science"), 'Col Start'] = 'Learn'

df['new'] = df['col A'].str.findall('('+ pat + ')').apply(lambda x: ', '.join(['Learn ' + y for y in x]))
print (df)

                           col A     interest Interests Col Start  \
0              Learn English Lit      English   English       NaN   
1                 Go Mathematics  Mathematics     Maths       NaN   
2                  Start Science      Science   Science       NaN   
3                  Learn Science      Science   Science     Learn   
4                     Go English      English   English       NaN   
5                     Start Math         Math      Math       NaN   
6                     Learn Math         Math      Math       NaN   
7                     Go Biology      Biology   Science       NaN   
8  Learn Science, Math, Biology.      Science   Science     Learn   

                                        new  
0                             Learn English  
1                         Learn Mathematics  
2                             Learn Science  
3                             Learn Science  
4                             Learn English  
5                                Learn Math  
6                                Learn Math  
7                             Learn Biology  
8  Learn Science, Learn Math, Learn Biology  

编辑:

print (df)
                 col A         Col Learn
0  Learn Science, Math  Learn S, Learn M
1  Learn Math, English  Learn M, Learn E
2        Learn Science           Learn S
3              Science               val

#create dictionary for new values by keys
d = {'Science':'S', 'English':'E', 'Math':'M'}
#check if Learn
mask = df['col A'].str.contains("Learn", na=False)
#extract all values by keys of dict, replace values by dicts by lookup and join with Learn
s = (df['col A'].str.findall('('+ '|'.join(d.keys()) + ')')
                .apply(lambda x: ', '.join(['Learn ' + d[y] for y in x])))

df['new'] = np.where(mask, s, df['col A'])
print (df)
                 col A         Col Learn               new
0  Learn Science, Math  Learn S, Learn M  Learn S, Learn M
1  Learn Math, English  Learn M, Learn E  Learn M, Learn E
2        Learn Science           Learn S           Learn S
3              Science               val           Science