我的文字包含有关恐怖袭击的不同新闻文章。每篇文章都以html标签(<p>Advertisement
)开头,我想从每篇文章中提取一些具体信息:恐怖袭击中受伤的人数。
这是文本文件的示例以及文章的分离方式:
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” , The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,, By KAREEM FAHIM and MOHAMAD FAHIM ABED JUNE 30, 2016
, At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. , KABUL, Afghanistan — Taliban insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. , During a year...]
到目前为止,这是我的代码:
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
result = re.findall(pattern,article)
我得到的输出是:
[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]
我想让输出更具可读性,然后将其保存为csv文件:
article_1,0
article_2,0
article_3,40
article_3,150
article_3,94
有关如何使其更具可读性的任何建议吗?
答案 0 :(得分:1)
我重写了这样的循环,并在你请求之后与"give" NP "to" NP
写合并:
csv
import csv
with open ("wounded.csv","w",newline="") as f:
writer = csv.writer(f, delimiter=",")
for i,article in enumerate(splitted):
result = re.findall(pattern,article)
nb_casualties = sum(int(x) for x in result[0] if x) if result else 0
row=["article_{}".format(i+1),nb_casualties]
writer.writerow(row)
enumerate
,将受害者的数量(如果超过1组匹配)相加,只有在匹配的情况下(三元表达式检查)sum
对象的行(每次迭代一行)。