Question

我有一些代码可以正常使用python中的正则表达式删除标点/数字，我不得不稍微更改代码，以便停止列表工作，不是特别重要。无论如何，现在标点符号没有被删除，坦率地说，我很难理解为什么。

import re
import nltk

# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')

# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list:
    word = punctuation.sub("", word)
print word_list

关于它为什么不起作用的任何指针都会很棒，我不是python的专家所以它可能是一些非常愚蠢的东西。感谢。

Answer 1

更改

for word in word_list:
    word = punctuation.sub("", word)

到

word_list = [punctuation.sub("", word) for word in word_list]

在上面的word中分配给for-loop，只需更改此临时变量引用的值。它不会改变word_list。

Answer 2

您没有更新单词列表。尝试

for i, word in enumerate(word_list):
    word_list[i] = punctuation.sub("", word)

请记住，尽管word作为对word_list中字符串对象的引用开始，但赋值会将名称word重新绑定到sub返回的新字符串对象功能。它不会更改最初引用的对象。

从文本问题中删除标点符号/数字

2 个答案: