我要从Twitter抓取报价,然后从这些报价中分离出实际的报价和作者。
如果tweet的格式不统一,怎么办?
我是regex的新手,但这是我对regex101 https://regex101.com/r/m3WtmX/5的最佳尝试。
下面是我的代码,我希望每个循环都打印sre.SRE_Match object
,但最后一个循环打印None
。
import re
QUOTE_PATTERN = re.compile(r'^(?P<actual_quote>.*)\s+?-\s*(?P<author>.*)$')
# actual_quote is separated from author by space and dash
format_1 = "Any form of exercise, if pursued continuously, will help train us in perseverance -Mao Tse-Tung"
# separated by one space, dash and another space
format_2 = "Any form of exercise, if pursued continuously, will help train us in perseverance - Mao Tse-Tung"
# actual_quote is surrounded with double quotes character and
# is separated from author by space, dash and another space
format_3 = '"Any form of exercise, if pursued continuously, will help train us in perseverance" - Mao Tse-Tung'
# separated only with dash (no space)
format_4 = "Any form of exercise, if pursued continuously, will help train us in perseverance-Mao Tse-Tung"
for format in [format_1, format_2, format_3, format_4]:
print(QUOTE_PATTERN.match(format))
答案 0 :(得分:0)
这确实很棘手,因为此数据的结构不是常规。
以非贪婪的方式获取破折号前第一组的所有字符都可以使用您提供的引号。
^(?P<actual_quote>.*?)-(?P<author>.*)$
https://regex101.com/r/rcGzzK/2
如果您不想添加多余的空格:
^(?P<actual_quote>.*?)\s*-\s*(?P<author>.*)$
https://regex101.com/r/rcGzzK/3
不幸的是,如果引号本身包含任何破折号,则上述正则表达式将不起作用。