Question

我试图用Python中的Stanford Word Segmenter来分割中文句子，但目前结果中有标点符号。我希望在没有标点符号的情况下返回结果，只返回单词。最好的方法是什么？我试着谷歌搜索答案，但没有找到任何答案。

Answer 1

我认为在文本分段后删除标点符号会更好;我相当肯定斯坦福大学分校在完成工作时会从标点符号中获取线索，所以你事先并不想这样做。以下适用于UTF-8文本。对于中文标点符号，请使用带有正则表达式的Zhon库：

import zhon.hanzi
import re
h_regex = re.compile('[%s]' % zhon.hanzi.punctuation)
intxt = # segmented text with punctuation
outtxt = h_regex.sub('', intxt)

根据您正在使用的文字，您可能还需要删除非中文标点符号：

import string
p_regex = re.compile('[%s]' % re.escape(string.punctuation))
outtext2 = p_regex.sub('', outtxt)

然后你应该是金色的。

用于Python中文的Stanford Word Segmenter如何在没有标点符号的情况下返回结果

1 个答案: