Question

我有一个文本，我需要抓取数据并将其拆分。我需要找到＆＃34;评论频率＆＃34;在一大组文本中，一旦找到它，就把它后面的所有内容都停在＆＃39;。＆＃39;。示例文本是：

No. of components Variable
Review frequency Quarterly (Mar., Jun., Sep., Dec.)
Quick facts
To learn more about the

我需要的是＆＃39;季度＆＃39;和＆＃39; 3月，6月，9月，12月＆＃39;

我现在的正则表达式是：

((?=.*?\bReview frequency\b)(\b(Q|q)uarterly|(A|a)nnually|(S|s)emi-(A|a)nnually))

但这不起作用。本质上是评论频率＆＃39;在我们开始获取其他信息之前，我们需要成为限定符，因为文件中可能还有其他日期/时段。谢谢！

Answer 1

您没有匹配该行的其余数据。

我建议使用：

(?m)^Review frequency[ \t]+(\w+)[ \t]+(.+)

请参阅regex demo

如果第一个捕获组只能包含模式中指示的3个单词，请使用

(?m)^Review frequency[ \t]+([Qq]uarterly|(?:[Ss]emi-)?[Aa]nnually)[ \t]+(.*)

请参阅another regex demo

Use these patterns with re.findall：

import re
regex = r"(?m)^Review frequency[ \t]+([Qq]uarterly|(?:[Ss]emi-)?[Aa]nnually)[ \t]+(.*)"
test = "No. of components Variable\nReview frequency Quarterly (Mar., Jun., Sep., Dec.\nQuick facts\nTo learn more about the"
print(re.findall(regex, test))

在RegEx中需要帮助以获取强制值后的任何内容

1 个答案: