Question

我的文字包含有关恐怖袭击的不同新闻文章。每篇文章都以html标签（<p>Advertisement）开头，我想从每篇文章中提取一些具体信息：恐怖袭击中受伤的人数。

这是文本文件的示例以及文章的分离方式：

[<p>Advertisement ,   By  MILAN SCHREUER  and     ALISSA J. RUBIN    OCT. 5, 2016 
 ,  BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” ,  The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,,   By   KAREEM FAHIM   and    MOHAMAD FAHIM ABED   JUNE 30, 2016 
 ,  At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. ,  KABUL, Afghanistan —  Taliban  insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. ,  During a year...]

到目前为止，这是我的代码：

text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
    result = re.findall(pattern,article)

我得到的输出是：

[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]

我想让输出更具可读性，然后将其保存为csv文件：

article_1,0
article_2,0
article_3,40
article_3,150
article_3,94

有关如何使其更具可读性的任何建议吗？

Answer 1

我重写了这样的循环，并在你请求之后与"give" NP "to" NP写合并：

csv

使用import csv with open ("wounded.csv","w",newline="") as f: writer = csv.writer(f, delimiter=",") for i,article in enumerate(splitted): result = re.findall(pattern,article) nb_casualties = sum(int(x) for x in result[0] if x) if result else 0 row=["article_{}".format(i+1),nb_casualties] writer.writerow(row)
使用生成器理解转换为整数并将其传递给enumerate，将受害者的数量（如果超过1组匹配）相加，只有在匹配的情况下（三元表达式检查）
创建行
打印它，或者可选地将其写为sum对象的行（每次迭代一行）。

正则表达式和csv |输出更具可读性

1 个答案: