如何使用pandas中的输入语料库/列表从列中提取所有字符串匹配项?

时间:2016-11-21 08:45:42

标签: python regex pandas nltk text-mining

例如,我将以下字符串列表作为输入语料库(实际上是一个包含100个值的大列表)。 行动= ['跳转''飞''运行''游']

数据包含一个名为action_description的列。如何使用动作列表作为输入语料库提取action_description中的所有字符串匹配?

注意:我已经完成了lemmitization description_action,所以如果列中有跳跃或跳跃等单词已经转换为跳转。

示例输入&输出

"I love to run and while my friend prefer to swim" --> "run swim"
"Allan excels at high jump but he is not a good at running" --> "jump run"

注意:我发现了下面的pandas功能,但它没有很好地记录,因此无法弄清楚如何使用它。

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extractall.html

请推荐一个最佳解决方案,因为输入数据帧有200K行。

EDIT 像跳线和放大器这样的词跑道应该被算法忽略,即不应该归类为跳跃&运行

3 个答案:

答案 0 :(得分:2)

<强> 步骤:

  1. 我们仅通过提供pos='v'对动词进行词形还原,并通过迭代str.split操作获得的列表中的每个单词,让名词保持原样。
  2. 然后,使用set获取查找列表和词形列表中出现的所有单词匹配项。
  3. 最后,加入它们以返回字符串作为输出。
  4. from nltk.stem.wordnet import WordNetLemmatizer
    
    action = ['jump','fly','run','swim']     # lookup list
    lem = WordNetLemmatizer() 
    fcn = lambda x: " ".join(set([lem.lemmatize(w, 'v') for w in x]).intersection(set(action)))
    df['action_description'] = df['action_description'].str.split().apply(fcn)
    df
    

    enter image description here

    开始使用DF

    df = pd.DataFrame(dict(action_description=["I love to run and while my friend prefer to swim", 
                                               "Allan excels at high jump but he is not a good at running"]))
    

    要生成二进制标志(0/1),我们可以通过在空格上拆分字符串并计算它的指示符变量来使用str.get_dummies方法,如下所示:

    bin_flag = df['action_description'].str.get_dummies(sep=' ').add_suffix('_flag')
    pd.concat([df['action_description'], bin_flag], axis=1)
    

    enter image description here

答案 1 :(得分:1)

<?for-each:root[category-id='1001]?>

答案 2 :(得分:1)

使用re.findall匹配字符串和operator.add来匹配匹配

这确实是一个正则表达式问题
import pandas as pd
import re
import operator as op


action=['jump','fly','run','swim']

str1="I    love to run and while my friend prefer to swim" ##--> "run swim"
str2="Allan excels at high jump but he is not a good at running" ##--> "jump run



df=pd.DataFrame({'A':[1,2,3,4],
                  'B':['I    love to run and while my friend prefer to swim',
                  'Allan excels at high jump but he is not a good at running',
                  'Ostrich can run very fast but cannot fly',
                  'The runway was wet hence the Jumper flew over it'] })


df['ApproxMatch']=df['B'].apply(lambda x: [reduce(op.add, re.findall(act,x)) for act in action if re.findall(act,x) <> []] )

#using r'\b'+jump+r'\b' to match jump exactly, where \b stands for word boundaries

df['ExactMatch']=df['B'].apply(lambda x: [reduce(op.add, re.findall(r"\b"+act+r"\b",x)) for act in action if re.findall(r"\b"+act+r"\b",x) <> []] )

<强>输出:

df

#   A                                                  B  ApproxMatch  \
#0  1  I    love to run and while my friend prefer to...  [run, swim]   
#1  2  Allan excels at high jump but he is not a good...  [jump, run]   
#2  3           Ostrich can run very fast but cannot fly   [fly, run]   
#3  4   The runway was wet hence the Jumper flew over it        [run]   
#
#    ExactMatch  
#0  [run, swim]  
#1       [jump]  
#2   [fly, run]  
#3           [] 

请注意,对于第2行的完全匹配,&#34;运行&#34;与&#34;运行&#34;

不匹配