从列表中存在的列中提取2gram字符串

时间:2019-03-20 20:07:16

标签: python pandas data-cleaning

我有一个名为df的数据框

Gender  Country      Comments
male    USA        machine learning and fraud detection are a must learn
male    Canada     monte carlo method is great and so is hmm,pca, svm and neural net
female  USA        clustering and cloud computing
female  Germany    logistical regression and data management and fraud detection
female  Nigeria    nltk and supervised machine learning
male    Ghana      financial engineering and cross validation and time series

以及称为算法的列表

algorithms = ['machine learning','fraud detection', 'monte carlo method', 'time series', 'cross validation', 'supervised machine learning', 'logistical regression', 'nltk','clustering', 'data management','cloud computing','financial engineering']

所以从技术上讲,对于“注释”列的每一行,我试图提取出现在算法列表中的单词。 这就是我要实现的目标

Gender  Country      algorithms
male    USA        machine learning, fraud detection 
male    Canada     monte carlo method, hmm,pca, svm, neural net
female  USA        clustering, cloud computing
female  Germany    logistical regression, data management, fraud detection
female  Nigeria    nltk, supervised machine learning
male    Ghana      financial engineering, cross validation, time series

但是,这就是我要得到的

Gender  Country      algorithms
male    USA         
male    Canada     hmm pca svm  
female  USA        clustering
female  Germany    
female  Nigeria    nltk
male    Ghana      

诸如机器学习和欺诈检测之类的词不会出现。基本上都是2克单词

这是我使用的代码

df['algorithms'] = df['Comments'].apply(lambda x: " ".join(x for x in x.split() if x in algorithms)) 

4 个答案:

答案 0 :(得分:2)

您可以pandas.Series.str.findalljoin组合使用。

import pandas as pd
import re

df['algo_new'] = df.algo.str.findall(f"({ '|'.join(ml) })")

>> out

    col1    gender  algo                                                algo_new
0   usa     male    machine learning and fraud detection are a mus...   [machine learning, fraud detection, clustering]
1   fr      female  monte carlo method is great and so is hmm,pca,...   [monte carlo method]
2   arg     male    logistical regression and data management and ...   [logistical regression, data management, fraud..

我们使用join将您的字符串加入您的ml列表中,并在每个字符串之间添加一个|以捕获值1 OR的值2等。然后,我们使用{ {1}}查找所有出现的事件。

请注意,它使用f字符串,因此您需要python 3.6+。让我知道您是否有较低版本的python。


对于任何对基准测试感兴趣的人(因为我们有3个答案),使用每个具有960万行的解决方案并连续运行10次,可以得到以下结果:

  • AlexK:
    • 平均:14.94秒
    • min:12.43秒
    • 最大:17.08秒
  • 泰迪熊:
    • 平均:22.67秒
    • min:18.25秒
    • 最大:27.64秒
  • 绝对空间
    • 平均:24.12秒
    • min:21.25秒
    • 最大:27.53秒

答案 1 :(得分:0)

这可能对您有用:

def f(stringy):
    contained = filter((lambda x: x in stringy), algorithms)
    return ",".join(contained)

df['algorithms'] = df['Comments'].apply(f)

您可以以此遍历所有输入字符串。

答案 2 :(得分:0)

另一种可能的解决方案:

#function to convert Comments field into a list of terms found in Algorithms list
#it searches Comments field for occurrences of algorithm substrings
def make_algo_list(comment):
    algo_list = []
    for algo in algorithms:
        if algo in comment:
            algo_list.append(algo)
    return algo_list

#apply function to create new column
df['algorithms'] = df['Comments'].apply(lambda x: make_algo_list(x))

答案 3 :(得分:0)

Flashtext还可用于此过程中的关键字提取,无论它是双字母组还是任何ngram ...

import pandas as pd
from flashtext import KeywordProcessor
df=pd.DataFrame(data = [['male', 'USA', 'machine learning and fraud detection are a must learn'],
                  ['male', 'Canada','monte carlo method is great and so is hmm,pca, svm and neural net'],
                  ['female','USA','clustering and cloud'],
                  ['female','Germany', 'logistical regression and data management and fraud detection']] ,columns = ['Gender', 'Country','Comments'])


algorithms = ['machine learning','fraud detection', 'monte carlo method', 'time series', 'cross validation', 'supervised machine learning', 'logistical regression', 'nltk','clustering', 'data management','cloud computing','financial engineering']


 kp = KeywordProcessor()
 kp.add_keywords_from_list(algorithms)


df['algorithm'] = df['Comments'].apply(lambda x : kp.extract_keywords(x))

#o/p
df['algorithm']
Out[20]: 
0                  [machine learning, fraud detection]
1                                 [monte carlo method]
2                                         [clustering]
3    [logistical regression, data management, fraud...
Name: algorithm, dtype: object