我正在使用python正则表达式读取文档。
我在许多文档中都有以下行:
Dated: February 4, 2011 THE REAL COMPANY, INC
我可以使用python文本搜索轻松找到具有“日期”的行,但是我想从文本中拉出THE REAL COMPANY,INC,而无需获取“ 2011年2月4日”文本。
我尝试了以下方法:
[A-Z\s]{3,}.*INC
我对这个正则表达式的理解是,它应该让我在LLP之前使用所有大写字母和空格,但是反而会拉长整行。
这向我表明,我从根本上缺少有关正则表达式如何与大写字母一起使用的信息。我缺少一个简单明了的解释吗?
答案 0 :(得分:0)
使用方法:
>>> import re
>>> txt
'Dated: February 4, 2011 THE REAL COMPANY, INC'
>>> re.findall('([A-Z][A-Z]+)', txt)
['THE', 'REAL', 'COMPANY', 'INC']
另一种解决方法如下,由@davedwards建议:
>>> re.findall('[A-Z\s]{3,}.*', txt)
[' THE REAL COMPANY, INC']
说明:
[A-Z\s]{3,}.* Match a single character present in the list below [A-Z\s]{3,} {3,} Quantifier — Matches between 3 and unlimited times, as many times as possible, giving back as needed (greedy) A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive) \s matches any whitespace character (equal to [\r\n\t\f\v ]) .* matches any character (except for line terminators) * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
答案 1 :(得分:0)
答案 2 :(得分:0)
您的正则表达式[A-Z\s]{3,}.*INC
匹配3个或更多大写字符或空白字符,然后匹配0+倍任何字符,然后匹配INC:THE REAL COMPANY, INC
您还可以做的是匹配Dated:从字符串的开头,后跟格式之类的日期,然后捕获组中的后续内容。您的值将在第一个捕获组中:
^Dated:\s+\S+\s+\d{1,2},\s+\d{4}\s+(.*)$
说明
^Dated:\s+
匹配日期:后跟1+倍的空白字符\S+\s+
匹配1次以上不是空白字符,然后匹配1次以上空白字符(在这种情况下将匹配2月)\d{1,2},
每位匹配1-2次\s+\d{4}\s+
匹配1+倍空白字符,4位数字,然后匹配1+倍空白字符(.*)
分组捕获0+次任何字符$
声明字符串的结尾