Question

我想把一个字符串作为文本并制作一个将它分成单词的标记器我希望我的tokenizer能够识别电子邮件帐户，URL，数字和标点符号。

我使用了正则表达式，我能够创建模式来识别每个类别。我只是想弄清楚如何将不遵循任何这些模式的词分成标记例如

patternpunctuation="[^\w\s]\s+"
patternnumber="[0-9]+"
patternmail="\w+@{1}[^\s<>()@]+"
patternurl="https?://[^\s<>()\"]+|www\.[^\s<>\"]+"

我想要标记的字符串是

line=" John 32 Smith global@hotmail.com.gr ddfdwww.google.com        fdfdhttp://google.com/index/agroup.html peter murphy alexis xronis 54^ & ^ & ^ & % % $ % % ^ ^ ! 68! @ @ # https://facebook.com.edu  re@dfdffe.gov.gr ! @ ^ "

Answer 1

可能您不想使用word_tokenze，因为它会收到所有错误的电子邮件：

>>> line=" John 32 Smith global@hotmail.com.gr ddfdwww.google.com        fdfdhttp://google.com/index/agroup.html peter murphy alexis xronis 54^ & ^ & ^ & % % $ % % ^ ^ ! 68! @ @ # https://facebook.com.edu  re@dfdffe.gov.gr ! @ ^ "
>>> from nltk import word_tokenize
>>> word_tokenize(line)
['John', '32', 'Smith', 'global', '@', 'hotmail.com.gr', 'ddfdwww.google.com', 'fdfdhttp', ':', '//google.com/index/agroup.html', 'peter', 'murphy', 'alexis', 'xronis', '54^', '&', '^', '&', '^', '&', '%', '%', '$', '%', '%', '^', '^', '!', '68', '!', '@', '@', '#', 'https', ':', '//facebook.com.edu', 're', '@', 'dfdffe.gov.gr', '!', '@', '^']

如果它们全部用空格分隔，那么您只需使用str.split()就可以实现：

['John', '32', 'Smith', 'global@hotmail.com.gr', 'ddfdwww.google.com', 'fdfdhttp://google.com/index/agroup.html', 'peter', 'murphy', 'alexis', 'xronis', '54^', '&', '^', '&', '^', '&', '%', '%', '$', '%', '%', '^', '^', '!', '68!', '@', '@', '#', 'https://facebook.com.edu', 're@dfdffe.gov.gr', '!', '@', '^']

但识别联合起来的单词并非易事......

您可以尝试使用以下技巧剔除http，www和e@mail.whatever，但请注意，您将需要修改正则表达式以适合您的数据。

re.findall(r'www\.[a-z].*.com', i) # www
re.findall(r'http:[a-z\/].*.html', i) # http
re.findall(r'https:[a-z\/].*', i) # https
re.findall(r'[a-z].*@[a-z].*.gov.[a-z].*',  # xxx@xxx.gov.xxx

使用nltk的python-simple字符串标记生成器

1 个答案: