我有一个辩论文件,如下所示(添加粗体以便更清楚地阅读):
TAPPER:参议员卢比奥? RUBIO:每次选举都很重要。我相信这是一代人中最重要的选举。因为这次选举的利害关系不仅仅是党派将负责哪一方或哪位候选人获胜。关键是我们作为一个国家和一个民族的身份 RUBIO:两个多世纪以来,美国一直是一个特殊的国家。现在是时候了,这一代人必须采取必要的措施来保持这种状态。如果我们在这次选举中做出正确的选择,我们的孩子将成为有史以来最自由,最繁荣的美国人。 21世纪将成为一个新的美国世纪 (掌声)
TAPPER:参议员克鲁兹? CRUZ:五十九年前,佛罗里达欢迎我父亲去美国,因为他将渡船从古巴下船到基韦斯特。他18岁。他充满希望和梦想,但他却在地球上最自由的土地上 这次选举,这次辩论不是侮辱。这不是攻击。这不是关于这个阶段的任何个人。这次选举是关于你和你的孩子的。这是关于美国一直拥有的自由,并确保下一代的自由在那里,我们阻止华盛顿阻止美国努力工作的纳税人。 (掌声)
TAPPER:特朗普先生? TRUMP:世界上任何地方最大的政治事件之一正在与共和党一起发生。数百万人正在参加投票,他们正在投票。他们出于热情投票。他们是出于爱而投票。坦率地说,其中一些人从未投票过 - 50岁,60岁,70岁 - 从未投票过。 我们是从民主党那里接过来的。我们把人当作独立人士,他们全都出来了,整个世界都在谈论它。这是非常令人兴奋。我认为,坦率地说,共和党的建立,或任何你想称之为的,都应该接受正在发生的事情。 我们有数百万的额外人加入。我们打算击败民主党人。我们打算击败希拉里或无论是谁。而且我们会打得很好 (掌声)
我想用这个文本文件创建两个文件:一个包含Cruz所说的所有语句,另一个文件只是特朗普所说的文件。知道怎么样?我尝试了以下正则表达式,它允许我找到候选人所说的每一行的文本,但如果它被换行符分解,则不会出现以下行。
import re
with open('RepDebate_FL.txt') as f:
for line in f:
cruz_regex = str(re.findall(r'CRUZ:.*', line))
trump_regex = str(re.findall(r'TRUMP:.*', line))
if cruz_regex is not None:
print(cruz_regex)
意思是我得到这一行:
['CRUZ:五十九年前,佛罗里达州乘坐渡船从古巴驶入基韦斯特,欢迎我的父亲来到美国。他18岁。他充满希望和梦想,但他却在地球上最自由的土地上。']
但接下来是空白的,因为它被新行打破了,并没有开头 'CRUZ:':
[]
任何和所有的帮助表示赞赏,TIA。
答案 0 :(得分:1)
您可以使用re.split
和itertools grouper
配方。
import itertools
import re
def grouper(iterable, n, fillvalue=None):
iters = [iter(iterable)] * n
return itertools.zip_longest(*iters, fillvalue=fillvalue)
s = filter(None, re.split(r"([A-Z]+:)", the_text))
pairs = grouper(s, 2)
这会将pairs
留作:
('TAPPER:', ' Senator Rubio?\n')
('RUBIO:', " Every election is important. I believe this is the most important election in a generation. Because what's at stake in this election is not simply what party is going to be in charge or which candidate wins. What's at stake is our identity as a nation and as a people.\n")
('RUBIO:', ' For over two centuries, America has been an exceptional nation. And now the time has come for this generation to do what it must do to keep it that way. If we make the right choice in this election, our children are going to be the freest and most prosperous Americans that have ever lived. And the 21st century is going to be a new American century.\n(APPLAUSE)\n')
('TAPPER:', ' Senator Cruz?\n')
('CRUZ:', " Fifty-nine years ago, Florida welcomed my father to America as he stepped off the ferry boat from Cuba onto Key West. He was 18. He was filled with hopes and dreams, and yet he was in the freest land on the face of the earth.\nThis election, this debate is not about insults. It's not about attacks. It's not about any of the individuals on this stage. This election is about you and your children. It's about the freedom America has always had and making sure that that freedom is there for the next generation, that we stop Washington from standing in the way of the hard-working taxpayers of America.\n(APPLAUSE)\n")
('TAPPER:', ' Mr. Trump?\n')
('TRUMP:', " One of the biggest political events anywhere in the world is happening right now with the Republican Party. Millions and millions of people are going out to the polls and they're voting. They're voting out of enthusiasm. They're voting out of love. Some of these people, frankly, have never voted before - 50 years old, 60 years old, 70 years old - never voted before. We're taking people from the Democrat Party. We're taking people as independents, and they're all coming out and the whole world is talking about it. It's very exciting. I think, frankly, the Republican establishment, or whatever you want to call it, should embrace what's happening. We're having millions of extra people join. We are going to beat the Democrats. We are going to beat Hillary or whoever it may be. And we're going to beat them soundly.\n(APPLAUSE)")
然后我们只是迭代并检查说话人姓名。
for speaker, body in pairs:
if "TRUMP" in speaker:
# write body to trump file
elif "CRUZ" in speaker:
# write body to cruz file