Question

我试过多次以及从字符串中删除多余标点符号的方法。

import string

class NLP:

    def __init__(self,sentence):

        self.sentence  = sentence.lower()

        self.tokenList = []


    #problem were the punct is still included in word
    def tokenize(self, sentence):

        for word in sentence.split():
            self.tokenList.append(word)

            for i in string.punctuation:
                if(i in word):
                    word.strip(i)
                    self.tokenList.append(i)

快速解释代码...... 它假设要分割每个单词和标点符号并将它们存储在列表中。但是，当我在一个单词旁边有标点符号时，它会保留在单词中。下面是一个逗号仍然与单词“hello”

组合在一起的示例

['hello,' , ',' , 'my' , 'name' , 'is' , 'freddy']
      #^
     #there's the problem

Answer 1

Python字符串是不可变的。因此，word.strip(i)确实不“更改word”正如您所假设的那样;相反，它返回word的副本，由.strip(i)操作修改 - 仅从字符串的结尾中删除，因此不是你想要的（除非你知道标点符号出现在一个特殊的顺序中）。

def tokenize(self, sentence):
    for word in sentence.split():
        punc = []
        for i in string.punctuation:
            howmany = word.count(i)
            if not howmany: continue
            word = word.replace(i, '')
            punc.extend(howmany*[i])
        self.tokenList.append(word)
        self.tokenList.extend(punc)

这假定所有标点符号都可以，每个项目一个，清理后的单词，与标点符号出现在单词中的位置无关。

例如，如果sentence为(here)，则列表为['here', '(', ')']。

如果列表中的事物排序有更严格的限制，请编辑您的Q以清楚地表达它们 - 理想情况下还有所需输入和输出的示例！

Answer 2

我建议采用不同的方法：

import string
import itertools

def tokenize(s):
    tokens = []
    for k,v in itertools.groupby(s, lambda c: c in string.punctuation):
        tokens.extend("".join(v).split())
    return tokens

测试：

>>> tokenize("this is, a test, you know")
['this', 'is', ',', 'a', 'test', ',', 'you', 'know']

Python不从字符串中删除字符

2 个答案: