让我们说我有两个单词列表,一个跟随另一个。它们通过空格或破折号连接。为了简单起见,他们将是相同的词:
First=['Derp','Foo','Bar','Python','Monte','Snake']
Second=['Derp','Foo','Bar','Python','Monte','Snake']
因此存在以下单词的以下组合(由是表示):
Derp Foo Bar Python Monte Snake
Derp No No Yes Yes Yes Yes
Foo Yes No No Yes Yes Yes
Bar Yes Yes No Yes Yes Yes
Python No Yes Yes No Yes Yes
Monte No Yes Yes No No No
Snake Yes No Yes Yes Yes No
我有一个这样的数据集,我正在检测特定的单词:
df=pd.DataFrame({'Name': [ 'Al Gore', 'Foo-Bar', 'Monte-Python', 'Python Snake', 'Python Anaconda', 'Python-Pandas', 'Derp Bar', 'Derp Python', 'JavaScript', 'Python Monte'],
'Class': ['Politician','L','H','L','L','H', 'H','L','L','Circus']})
如果我使用Regex并标记来自模式的所有数据,它将看起来像这样:
import pandas as pd
df=pd.DataFrame({'Name': [ 'Al Gore', 'Foo-Bar', 'Monte-Python', 'Python Snake', 'Python Anaconda', 'Python-Pandas', 'Derp Bar', 'Derp Python', 'JavaScript', 'Python Monte'],
'Class': ['Politician','L','H','L','L','H', 'H','L','L','Circus']})
df['status']=''
patterns=['^Derp(-|\s)(Foo|Bar|Snake)$', '^Foo(-|\s)(Bar|Python|Monte)$', '^Python(-|\s)(Derp|Foo|Bar|Snake)', '^Monte(-|\s)(Derp|Foo|Bar|Python|Snake)$']
for i in range(len(patterns)):
df.loc[df.Name.str.contains(patterns[i]),'status'] = 'Found'
print (df)
这是印刷品:
>>>
Class Name status
0 Politician Al Gore
1 L Foo-Bar Found
2 H Monte-Python Found
3 L Python Snake Found
4 L Python Anaconda
5 H Python-Pandas
6 H Derp Bar Found
7 L Derp Python
8 L JavaScript
9 Circus Python Monte
[10 rows x 3 columns]
对于较大的数据集,写出所有正则表达式模式似乎不太可行。那么有没有办法制作循环或某些东西来从组合矩阵中检索模式以检索存在的模式(在上表中表示为是)并跳过那些不存在的模式(在上表中表示为否)?我知道在itertools
库中有一个名为combinations
的函数可以通过循环来生成并生成所有可能的模式。
答案 0 :(得分:1)
我认为从你得到的组合矩阵中生成这些正则表达式并不太难:
# Reading in your combination matrix:
pattern_mat = pd.read_clipboard()
# Map from first words to following words:
w2_dict = {}
for w1, row in pattern_mat.iterrows():
w2_dict[w1] = list(row.loc[row == 'Yes'].index)
# Print all the resulting regexes:
# (not sure if the backspace needs to be escaped?)
for w1, w2_list in w2_dict.items():
pattern = "^{w1}(-|\s)({w2s})$".format(w1=w1, w2s='|'.join(w2_list))
print(pattern)
输出:
^Monte(-|\s)(Foo|Bar)$
^Snake(-|\s)(Derp|Bar|Python|Monte)$
^Bar(-|\s)(Derp|Foo|Python|Monte|Snake)$
^Foo(-|\s)(Derp|Python|Monte|Snake)$
^Python(-|\s)(Foo|Bar|Monte|Snake)$
^Derp(-|\s)(Bar|Python|Monte|Snake)$