Question

我需要过滤句子并从整个句子中仅选择几个词

例如，我有示例文本：

            entitycode,usage
2016-01-01  1,16521
2016-01-01  2,6589
2016-01-02  1,16540
2016-01-02  2,6687
2016-01-03  1,16269
2016-01-03  2,6642

ID: a9000006        
NSF Org     : DMI
Total Amt.  : $225024

Abstract    :This SBIR proposal is aimed at (1) the synthesis of new ferroelectric liquid crystals with ultra-high polarization,                    
             chemical stability and low viscosity

我得到了token = re.compile('a90[0-9][0-9][0-9][0-9][0-9]| [$][\d]+ |') re.findall(token, filetext)，但是我不知道如何在'a9000006','$225024'之后的三个大写字母"NSF Org:"和"DMI"之后的所有文本中编写正则表达式

Answer 1

如果要匹配:之后的所有内容，请使用:\s?(.*)并捕获组1。

Live Demo

Answer 2

如果您要创建一个单个正则表达式来匹配这4个字段中的每一个，并对每个字段进行显式检查，则可以使用this regex：:\s?(a90[\d]+|[$][\d]+|[A-Z]{3}|.*$)

>>> token = re.compile(r':\s?(a90[\d]+|[$][\d]+|[A-Z]{3}|.*$)', re.DOTALL)  # flag needed
>>> re.findall(token, filetext)
['a9000006', 'DMI', '$225024', 'This SBIR proposal is aimed at (1) the synthesis of new ferroelectric liquid crystals wi
th ultra-high polarization,                    \n             chemical stability and low viscosity']
>>>

但是，由于您要同时搜索所有内容，因此最好使用一种将所有4个内容进行一般匹配的类型，例如this answer here中的一个。

Answer 3

这必须做。

: .*

您可以在此处检查。 check

如何为“：”之后的所有文本编写正则表达式

3 个答案: