希望使用正则表达式从数据框中剥离可预测的文本块

时间:2019-05-02 02:41:07

标签: python regex text nlp

我有一个检查结果和违规数据框架,如下所示:

Results                 Violations
Pass w/ Conditions  3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E

Pass                    36. THERMOMETERS PROVIDED & ACCURATE Comment...

我需要做的是让python循环遍历此pandas数据框,尤其是在“违规”列中,并确定 “以数字开头,以注释结尾:”

我能够使用正则表达式通过此行代码去除数字

df_new['Violations'] = df_new['Violations'].map(lambda x: 
    x.lstrip('0123456789.- ').rstrip('[^a-zA-Z]Comments[^a-zA-Z]'))

如您所见,我试图通过rstrip regex命令来实现注释结束,但这似乎无济于事。输出看起来像这样

Results Violations
0   Pass w/ Conditions  MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPL...
1   Pass    THERMOMETERS PROVIDED & ACCURATE - Comments: 4...

regex命令的基本含义是:查找数字并删除数字和注释之间的所有内容:

有一种简单的方法吗?

1 个答案:

答案 0 :(得分:0)

regex命令的基本含义是:查找数字并删除数字和注释之间的所有内容:

gsutil defstorageclass set regional gs://[BUCKET_NAME]


foo = '''\
Results                 Violations
Pass w/ Conditions  3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E
Pass                    36. THERMOMETERS PROVIDED & ACCURATE Comment...'''


>>> print(foo)
    Results                 Violations
    Pass w/ Conditions  3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E
    Pass                    36. THERMOMETERS PROVIDED & ACCURATE Comment...
>>>


import re
bar = re.sub('(\d+\.).*(Comment.*)', '\\1', foo)

参考:

字符串中子字符串的最后一次出现