让我们假设我们有以下数据帧:
import pandas as pd
df = pd.read_csv('subjects.csv')
Col A, Interest, Col Start, Col Go, Col Learn,
Learn English Lit
Go Mathematics
Start Science
Learn Science
Go English
Start Math
Learn Math
Go Biology
Start English
我已经编写了一些代码来从类似的数据集中提取兴趣,如下所示
#Map Interests
Mapper = ['English', 'Math', 'Maths', 'Mathematics', 'Biology', 'Science']
#Join Mapper to Interest Column
pat = '|'.join(r"\b{}\b".format(x) for x in Mapper)
df['interest'] = df['col A'].str.extract('('+ pat + ')', expand=False)
#Align Interest Names by creating a dict and replacing values
enter code here
d = {'English Lit' : 'English', 'Biology' : 'Science', 'Mathematics' : 'Maths'}
df['Interests'] = df['Interests'].replace(d, inplace=False)
>>> Output:
Col A, Interest, Col Start, Col Go, Col Learn,
Learn English Lit English
Go Mathematics Maths
Start Science Science
Learn Science Science
Go English English
Start Math Maths
Learn Math Maths
Go Biology Science
Start English English
现在,我需要衡量Col A与关键字和兴趣的互动情况。
我已按照以下方式完成此操作,但我确信有更好的方法可以做到这一点。
df['Col Start'][df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Science")] = 'Learn'
此外,将多个值附加到一列中的最佳方法是什么?例如,如果我有:
Col A
Learn Science, Math, Biology.
我希望将关键字+兴趣映射到一个新列,其值以逗号分隔。这是我当前的脚本崩溃的地方,它写了以前的新值,我试图捕获所有参与级别(如果这是有道理的..)
Col A Col B
Learn Science, Math, Biology. Learn S, Learn, M, Learn B
任何帮助都会令人感激。 (请温柔,我在2月开始编码!)
编辑清晰度:
df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Science"), 'Col Start'] = 'Learn S'
df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("English"), 'Col Start'] = 'Learn E'
df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Math"), 'Col Start'] = 'Learn M'
Col A Col Learn
Learn Science, Math Learn S, Learn M
Learn Math, English Learn M, Learn E
Learn Science Learn S.
在我的DF中,可能是Col A&利息可能重叠并产生经常性产出。我想要的是捕获它们而不是覆盖它们但是用逗号附加任何新输入。
答案 0 :(得分:1)
我认为需要findall
如果需要按列表整理提取所有值,并join
追加字符串Learn
:
#better is use loc for set new column
df.loc[df['col A'].str.contains("Learn", na=False) & df['interest'].str.contains("Science"), 'Col Start'] = 'Learn'
df['new'] = df['col A'].str.findall('('+ pat + ')').apply(lambda x: ', '.join(['Learn ' + y for y in x]))
print (df)
col A interest Interests Col Start \
0 Learn English Lit English English NaN
1 Go Mathematics Mathematics Maths NaN
2 Start Science Science Science NaN
3 Learn Science Science Science Learn
4 Go English English English NaN
5 Start Math Math Math NaN
6 Learn Math Math Math NaN
7 Go Biology Biology Science NaN
8 Learn Science, Math, Biology. Science Science Learn
new
0 Learn English
1 Learn Mathematics
2 Learn Science
3 Learn Science
4 Learn English
5 Learn Math
6 Learn Math
7 Learn Biology
8 Learn Science, Learn Math, Learn Biology
编辑:
print (df)
col A Col Learn
0 Learn Science, Math Learn S, Learn M
1 Learn Math, English Learn M, Learn E
2 Learn Science Learn S
3 Science val
#create dictionary for new values by keys
d = {'Science':'S', 'English':'E', 'Math':'M'}
#check if Learn
mask = df['col A'].str.contains("Learn", na=False)
#extract all values by keys of dict, replace values by dicts by lookup and join with Learn
s = (df['col A'].str.findall('('+ '|'.join(d.keys()) + ')')
.apply(lambda x: ', '.join(['Learn ' + d[y] for y in x])))
df['new'] = np.where(mask, s, df['col A'])
print (df)
col A Col Learn new
0 Learn Science, Math Learn S, Learn M Learn S, Learn M
1 Learn Math, English Learn M, Learn E Learn M, Learn E
2 Learn Science Learn S Learn S
3 Science val Science