使用正则表达式解析文本以进行情感分析

时间:2019-08-02 14:21:52

标签: python regex

我正在解析一个文本文件,其中包含数千种以下格式的文章,所有文章都遵循完全相同的模式。文本在虚线之间。

data-index=3

我想处理这些文章并仅保留:

a)第1行带有文档编号, b)标题,以及 c)正文

由于正文可能还包含我想保留的日期,因此我该如何用正则表达式表达呢?任何其他建议也将受到欢迎。谢谢您的帮助。

我希望使每篇文章具有以下格式,并在虚线之间添加文字。

index

1 个答案:

答案 0 :(得分:0)

我认为使用正则表达式可能不是解决此问题的最佳方法。

这是一个关于如何解决此问题的粗略想法。函数transform希望传递给一个迭代器,该迭代器一次返回一个输入行。这可以只是一个打开的文件。出于测试目的,我将测试字符串拆分为行列表,并为该列表传递了迭代器。该函数(生成器)可能需要一些微调,具体取决于您可能要从输入中删除多少行。为了进行测试,我在输入中添加了第二篇文章,就好像它是最后一篇文章一样。我猜想它可能如何终止。

生成器函数通过执行lines并将结果分配给变量next(lines),遍历变量line中的所有行,变量yield line是一个可迭代的对象。如果当前行要包含在输出中,则执行语句import re def transform(lines): try: line = None while True: if line is None: line = next(lines) # --------------- yield line line = next(lines) # 1 of 40 documents yield line line = next(lines) # blank line yield line line = next(lines) # blank line yield line line = next(lines) # blank line yield line line = next(lines) # July 22, 2016 9:42 - Do not yield this line line = next(lines) # blank line yield line line = next(lines) # blank line yield line line = next(lines) # blank line yield line line = next(lines) # This is the title of the document. yield line line = next(lines) # blank line yield line line = next(lines) # blank line yield line line = next(lines) # blank line yield line line = next(lines) # Author 1 and Author 2 in London - Do not yield this line while True: line = next(lines) if not re.match(r'\s*[A-Za-z]+\s+\d\d?,\s+\d{4}\s*$', line): # date? yield line else: line2 = next(lines) # blank ? line3 = next(lines) # ------------------------------- ? if line3 != '-------------------------------': yield line yield line2 yield line3 else: line = line3 break # start of new document except StopIteration: pass if __name__ == '__main__': text = """------------------------------- 1 of 40 DOCUMENTS July 22, 2016 9:42 This is the title of the document. Author 1 and Author 2 in London This is the body of the text. This paragraph has four sentences. There are 25 words in total. The meaning of the words is not important. July 23, 2016 ------------------------------- 1 of 40 DOCUMENTS July 22, 2016 9:42 This is the title of the document. Author 1 and Author 2 in London This is the body of the text. This paragraph has four sentences. There are 25 words in total. The meaning of the words is not important. July 23, 2016 """ for line in transform(iter(text.split('\n'))): print(line) 。我已经根据您要删除的内容而不是要保留的内容实施了该解决方案,因为从您的有限示例中还不清楚标题和文本正文的所有可能性。您似乎希望删除从“ -----------等”开始的第6行和第14行是第一行,而日期则出现在下一个'---------等'之前两行。如果第6行上的日期和第14行上的作者列表并不总是在这些固定位置上,则所有投注都将关闭。

您能准确描述输入格式吗?

-------------------------------
 1 of 40 DOCUMENTS






This is the title of the document.






This is the body of the text. This paragraph has four sentences. There are 25 words in total. The meaning of the words is not important.



-------------------------------
  1 of 40 DOCUMENTS






This is the title of the document.






This is the body of the text. This paragraph has four sentences. There are 25 words in total. The meaning of the words is not important.

产生的结果:

friend