Question

我正在对单词列表执行以下操作。我从Project Gutenberg文本文件中读取行，在空格上划分每一行，执行一般标点符号替换，然后在其自己的行上打印每个单词和标点符号以便稍后进行进一步处理。我不确定如何用标签替换每个单引号或者除了所有撇号。我目前的方法是使用编译的正则表达式：

apo = re.compile("[A-Za-z]'[A-Za-z]")

并执行以下操作：

if "'" in word and !apo.search(word):
    word = word.replace("'","\n<singlequote>")

但这忽略了在带有撇号的单词周围使用单引号的情况。它也没有向我表明单引号是否与单词结尾的单词的开头相邻。

示例输入：

don't
'George
ma'am
end.'
didn't.'
'Won't

示例输出（处理和打印到文件后）：

don't
<opensingle>
George
ma'am
end
<period>
<closesingle>
didn't
<period>
<closesingle>
<opensingle>
Won't

关于这项任务我还有一个问题：由于<opensingle>与<closesingle>的区别似乎相当困难，所以进行像

这样的替换会更明智吗？

word = word.replace('.','\n<period>')
word = word.replace(',','\n<comma>')

执行替换操作后

？

Answer 1

我建议在这里聪明地工作：改用nltk或其他NLP工具包。

像这样的

Tokenize words：

import nltk
sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)

你可能不喜欢像没有分开的收缩这样的事实。实际上，这是预期的行为。见Issue 401。

但是，TweetTokenizer可以提供帮助：

from nltk.tokenize import tknzr = TweetTokenizer()
tknzr.tokenize("The code didn't work!")

如果它涉及更多，RegexpTokenizer可能会有所帮助：

from nltk.tokenize import RegexpTokenizer
s = "Good muffins cost $3.88\nin New York.  Please don't buy me\njust one of them."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)

然后正确地注释标记化的单词应该更容易。

进一步参考：

Answer 2

您真正需要正确替换开始和结束' 是正则表达式。要匹配它们，您应该使用：

^'用于启动'（ opensingle ），
'$结束'（ closesingle ）。

不幸的是，replace方法不支持正则表达式，所以你应该使用re.sub代替。

下面是一个示例程序，打印您想要的输出（在 Python 3 中）：

import re
str = "don't 'George ma'am end.' didn't.' 'Won't"
words = str.split(" ")
for word in words:
    word = re.sub(r"^'", '<opensingle>\n', word)
    word = re.sub(r"'$", '\n<closesingle>', word)
    word = word.replace('.', '\n<period>')
    word = word.replace(',', '\n<comma>')
    print(word)

Answer 3

我认为这可以从先行或后观引用中受益。 python引用是https://docs.python.org/3/library/re.html，我经常引用的一个通用正则表达式网站是https://www.regular-expressions.info/lookaround.html。

您的数据：

words = ["don't",
         "'George",
         "ma'am",
         "end.'",
         "didn't.'",
         "'Won't",]

现在我将定义一个带有正则表达式及其替换的元组。

In [230]: apo = (
    (re.compile("(?<=[A-Za-z])'(?=[A-Za-z])"), "<apostrophe>",),
    (re.compile("(?<![A-Za-z])'(?=[A-Za-z])"), "<opensingle>",),
    (re.compile("(?<=[.A-Za-z])'(?![A-Za-z])"), "<closesingle>", ),
    (re.compile("(?<=[A-Za-z])\\.(?![A-Za-z])"), "<period>",),
)
     ...:      ...:      ...:      ...:      ...:      ...: 
In [231]: words = ["don't",
         "'George",
         "ma'am",
         "end.'",
         "didn't.'",
         "'Won't",]
     ...:      ...:      ...:      ...:      ...:      ...: 
In [232]: reduce(lambda w2,x: [ x[0].sub(x[1], w) for w in w2], apo, words)
Out[232]: 
['don<apostrophe>t',
 '<opensingle>George',
 'ma<apostrophe>am',
 'end<period><closesingle>',
 'didn<apostrophe>t<period><closesingle>',
 '<opensingle>Won<apostrophe>t']

以下是正则表达式的内容：

(?<=[A-Za-z])是一个 lookbehind ，意思是只匹配（但不消耗）如果前面的字符是一个字母。
(?=[A-Za-z])是预测（仍然没有消耗）。
(?<![A-Za-z])是一个负面的背后隐藏，这意味着如果前面有一个字母，则它将不匹配。
(?![A-Za-z])是否定前瞻。

请注意，我在.内添加了<closesingle>项检查，apo内的订单很重要，因为您可能会将.替换为<period>。

这是针对单个单词操作的，但也应该使用句子。

In [233]: onelong = """
don't
'George
ma'am
end.'
didn't.'
'Won't
"""
     ...:      ...:      ...:      ...:      ...:      ...:      ...: 
In [235]: print(
    reduce(lambda sentence,x: x[0].sub(x[1], sentence), apo, onelong)
)

     ...:      ...: 
don<apostrophe>t
<opensingle>George
ma<apostrophe>am
end<period><closesingle>
didn<apostrophe>t<period><closesingle>
<opensingle>Won<apostrophe>t

（reduce的使用是为了便于在单词/字符串上应用正则表达式.sub，然后将该输出保存到下一个正则表达式.sub等。

Python替换除Apostrophes之外的单引号

3 个答案: