我有一个包含tweetID,tweet-text,RegExp1,RegExp2,RegExp3,RegExp4和4个正则表达式列表的数据集。 我想在tweet-text列上逐个应用正则表达式,如果tweet-text满足正则表达式,那么我想在相应的RegExp列中将value设置为1,如果它不满足那么我想将它设置为0 。
例如,假设tweet-text满足正则表达式1,那么我想将相应的RegExp1列的值设置为1,并且不满足正则表达式2然后我想设置相应的RegExp2列' s值为0,依此类推。我尝试了最后给出的代码,但它对我没用。
我的数据集看起来像
tweetID | tweet-text | RegExp1 | RexExp2 | RegExp3 | RexExp4
---------------------------------------------------------------------
10001 | to get it or? | | | |
10333 | I just wonder :) | | | |
10933 | is it possible dude| | | |
14633 | he is good at | | | |
代码:
`regexes = [
re.compile('i asked .* said'),
re.compile('you asked me what .*'),
re.compile('(to get|to see|to look|is it true|is it possible) .*'),
re.compile('I .* wonder .*')
]
for regex, i in zip(regexes, range(4)):
columnName = "RegExp"+str(i+1)
for row in df['tweet-text']:
if(regex.search(row) != None):
df[columnName] = 1
else:
df[columnName] = 0`
(首选使用熊猫)谢谢
答案 0 :(得分:1)
您可以在循环中使用str.contains
。您需要传递正则表达式模式(不是已编译的正则表达式对象)。
这就是我要开始的:
In [1062]: df.head()
Out[1062]:
tweetID tweet-text RegExp1 RegExp2 RegExp3 RegExp4
0 10001 to get it or?
1 10333 I just wonder :)
2 10933 is it possible dude
3 14633 he is good at
In [1063]: regexes = [
...: 'i asked .* said',
...: 'you asked me what .*',
...: '(?:to get|to see|to look|is it true|is it possible) .*',
...: 'I .* wonder .*'
...: ]
接下来,为每个正则表达式模式运行一个循环。致电str.contains
并依次将结果分配给每一栏:
In [1090]: for i, r in enumerate(regexes):
...: df['RegExp%d' %(i + 1)] = df['tweet-text'].str.contains(r).astype(int)
...:
In [1091]: df.head()
Out[1091]:
tweetID tweet-text RegExp1 RegExp2 RegExp3 RegExp4
0 10001 to get it or? 0 0 1 0
1 10333 I just wonder :) 0 0 0 1
2 10933 is it possible dude 0 0 1 0
3 14633 he is good at 0 0 0 0