复合词使用pandas对大型数据集进行模式检测

时间:2014-03-14 07:04:18

标签: python pandas pattern-matching iteration

让我们说我有两个单词列表,一个跟随另一个。它们通过空格或破折号连接。为了简单起见,他们将是相同的词:

First=['Derp','Foo','Bar','Python','Monte','Snake']
Second=['Derp','Foo','Bar','Python','Monte','Snake'] 

因此存在以下单词的以下组合(由是表示):

            Derp    Foo  Bar    Python  Monte   Snake
Derp        No      No   Yes    Yes     Yes     Yes
Foo         Yes     No   No     Yes     Yes     Yes
Bar         Yes     Yes  No     Yes     Yes     Yes
Python      No      Yes  Yes    No      Yes     Yes
Monte       No      Yes  Yes    No      No      No
Snake       Yes     No   Yes    Yes     Yes     No

我有一个这样的数据集,我正在检测特定的单词:

df=pd.DataFrame({'Name': [ 'Al Gore', 'Foo-Bar', 'Monte-Python', 'Python Snake', 'Python Anaconda', 'Python-Pandas', 'Derp Bar', 'Derp Python', 'JavaScript', 'Python Monte'],
                 'Class': ['Politician','L','H','L','L','H', 'H','L','L','Circus']})

如果我使用Regex并标记来自模式的所有数据,它将看起来像这样:

import pandas as pd


df=pd.DataFrame({'Name': [ 'Al Gore', 'Foo-Bar', 'Monte-Python', 'Python Snake', 'Python Anaconda', 'Python-Pandas', 'Derp Bar', 'Derp Python', 'JavaScript', 'Python Monte'],
                 'Class': ['Politician','L','H','L','L','H', 'H','L','L','Circus']})
df['status']=''

patterns=['^Derp(-|\s)(Foo|Bar|Snake)$', '^Foo(-|\s)(Bar|Python|Monte)$', '^Python(-|\s)(Derp|Foo|Bar|Snake)', '^Monte(-|\s)(Derp|Foo|Bar|Python|Snake)$']


for i in range(len(patterns)):
    df.loc[df.Name.str.contains(patterns[i]),'status'] = 'Found'

print (df)

这是印刷品:

>>> 

        Class             Name status
0  Politician          Al Gore       
1           L          Foo-Bar  Found
2           H     Monte-Python  Found
3           L     Python Snake  Found
4           L  Python Anaconda       
5           H    Python-Pandas       
6           H         Derp Bar  Found
7           L      Derp Python       
8           L       JavaScript       
9      Circus     Python Monte       

[10 rows x 3 columns]

对于较大的数据集,写出所有正则表达式模式似乎不太可行。那么有没有办法制作循环或某些东西来从组合矩阵中检索模式以检索存在的模式(在上表中表示为是)并跳过那些不存在的模式(在上表中表示为否)?我知道在itertools库中有一个名为combinations的函数可以通过循环来生成并生成所有可能的模式。

1 个答案:

答案 0 :(得分:1)

我认为从你得到的组合矩阵中生成这些正则表达式并不太难:

# Reading in your combination matrix:
pattern_mat = pd.read_clipboard()
# Map from first words to following words:
w2_dict = {}
for w1, row in pattern_mat.iterrows():
    w2_dict[w1] = list(row.loc[row == 'Yes'].index)
# Print all the resulting regexes:
# (not sure if the backspace needs to be escaped?)
for w1, w2_list in w2_dict.items():
    pattern = "^{w1}(-|\s)({w2s})$".format(w1=w1, w2s='|'.join(w2_list))
    print(pattern)

输出:

^Monte(-|\s)(Foo|Bar)$
^Snake(-|\s)(Derp|Bar|Python|Monte)$
^Bar(-|\s)(Derp|Foo|Python|Monte|Snake)$
^Foo(-|\s)(Derp|Python|Monte|Snake)$
^Python(-|\s)(Foo|Bar|Monte|Snake)$
^Derp(-|\s)(Bar|Python|Monte|Snake)$