使用正则表达式库清理文本无法正常工作

时间:2019-06-21 03:36:52

标签: python regex

我有一个文本需要清除以作进一步处理。

以下是示例文本:

  

奈杰尔·鲁本·鲁克·威廉姆斯(1944年7月15日至1992年4月21日)是   英国保护者和陶瓷修复专家   玻璃。从1961年直到他去世,他在大英博物馆工作,   1983年,他成为陶瓷和玻璃的首席保护者。   在那里他的工作包括成功修复萨顿胡(Sutton Hoo)   头盔和波特兰花瓶。

     

威廉姆斯16岁时就加入了助手,他度过了整个职业生涯,   和他一生的大部分时间都在大英博物馆他是第一个   从事自然保护研究但尚未被认可为职业的人们,以及   从小就被赋予了高调对象的责任。   在1960年代,他协助萨顿胡(Sutton Hoo)的重新发掘   船葬,在20年代初至20年代中期,他保存了许多   其中发现的物体:最显着的是Sutton Hoo头盔,   占了他一年的时间。他同样重建了其他物体   包括盾牌,水牛角和枫木   瓶子。

     

“他一生的永恒热情”是陶瓷,[4]和1970年代,   1980年代为威廉姆斯在该领域提供了充足的机会。经过近   1974年,在其中发现了31,000个破碎的希腊花瓶碎片   在HMS巨像的沉船中,威廉姆斯着手将它们拼凑在一起。   电视转播了这个过程,并将他变成了电视   个性。十年后的1988年和1989年,威廉姆斯   当他把波特兰花瓶,其中之一   将世界上最著名的玻璃物品放回去。的   再次通过电视转播了BBC节目,并且   Sutton Hoo头盔花了将近一年的时间。

我需要:

  • 将文本分割为句子(由句号“。”分隔),消除句号

  • 将句子分成单词(仅拉丁字母),其他符号应替换为空格字符,并且只能使用单个空格来分隔这些单词

  • 以小写形式显示所有文本

我正在使用Mac,并且运行了以下代码:

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import re

fread = open('source.txt')
fwrite = open('result.txt','w+')

for line in fread:
    new_line = line    
    # split the text into sentences
    new_line = re.sub(r"\."  , "\r", new_line)

    # change all uppercase letters to lowercase
    new_line = new_line.lower()

    # only latin letters 
    new_line = re.sub("[^a-z\s]", " ", new_line)

    # The words should be separated by single spaces.
    new_line = re.sub(r" +"," ", new_line)

    # Getting rid of space in the beginning of the sentence 
    new_line = re.sub(r"ˆ\s+", "", new_line)
    fwrite.write(new_line)

fread.close()
fwrite.close()

结果与预期不符。每行开头的空格未删除。我在Windows机器上运行了相同的代码,我注意到有些句号用代替了句号,有时用代替了。所以我不确定发生了什么。

这里是结果样本。由于空格未在stackoverflow中显示,因此我不得不将文本显示为代码:

nigel reuben rook williams july april was an english conservator and expert on the restoration of ceramics and glass
 from until his death he worked at the british museum where he became the chief conservator of ceramics and glass in 
 there his work included the successful restorations of the sutton hoo helmet and the portland vase

joining as an assistant at age williams spent his entire career and most of his life at the british museum
 he was one of the first people to study conservation not yet recognised as a profession and from an early age was given responsibility over high profile objects
 in the s he assisted with the re excavation of the sutton hoo ship burial and in his early to mid twenties he conserved many of the objects found therein most notably the sutton hoo helmet which occupied a year of his time
 he likewise reconstructed other objects from the find including the shield drinking horns and maplewood bottles

the abiding passion of his life was ceramics and the s and s gave williams ample opportunities in that field
 after nearly fragments of shattered greek vases were found in amidst the wreck of hms colossus williams set to work piecing them together
 the process was televised and turned him into a television personality
 a decade later in and williams s crowning achievement came when he took to pieces the portland vase one of the most famous glass objects in the world and put it back together
 the reconstruction was again televised for a bbc programme and as with the sutton hoo helmet took nearly a year to complete

例如,不同的字符可能不会出现,例如,在加入之前,我看到使用??的两个TextWrangler

顺便说一下,使用lstrip()函数可删除每个句子开头的空格。

<new_line = re.sub(r"ˆ\s+", "", new_line)>为什么不起作用?

我怀疑用来标记行尾的'\ n'会引起一些问题。

2 个答案:

答案 0 :(得分:1)

# split the sentences into words 
new_line = re.sub("[^a-z\s]", " ", new_line)

这不是按照评论所说的做。它实际上是用空格替换所有非字母,非空格字符,这就是为什么您的输出缺少数字和标点符号的原因。

# Getting rid of space in the beginning of the sentence 
new_line = re.sub(r"ˆ\s+", "", new_line)

我不知道该正则表达式的开头是什么字符,但它不是行首字符^

答案 1 :(得分:0)

这里有几点提及:

  1. 将上下文管理器用于输入/输出文件,因为默认情况下它会在使用后处理关闭。

  2. 您的性格不正确,如John Gordon所说。

  3. 我建议使用一些正则表达式可视化工具(即https://jex.im/regulex/

  4. 仅用空格替换内容的基本方法是使用加号运算符df14['tup'] = df14.apply(lambda x: list(zip(x.key,x.hi)), axis=1) print (df14) key hi tup 0 [1, 2] [5, 6] [(1, 5), (2, 6)] 1 [3, 4] [7, 8] [(3, 7), (4, 8)] :(非字母字符)+(一个或多个)。

所以我完成了最后的代码片段

[^a-z]+