tokenize由字符串组成的数组

时间:2017-08-22 14:15:08

标签: python for-loop nltk tokenize

我有一个名为allchats的数组,由长字符串组成。索引中的一些位置如下所示:

allchats[5,0] = "Hi, have you ever seen something like that? no?"
allchats[106,0] = "some word blabla some more words yes"
allchats[410,0] = "I don't know how we will ever get through this..."

我希望对数组中的每个字符串进行标记。此外,我希望使用正则表达式工具来消除问号,逗号等。

我尝试了以下内容:

import nltk
from nltk.tokenize import RegexTokenizer

tknzr = RegexTokenizer('\w+')
allchats1 = [[tknzr.tokenize(chat) for chat in str] for str in allchats]

我希望最终得到:

allchats[5,0] = ['Hi', 'have', 'you', 'ever', 'seen', 'something', 'like', 'that', 'no']
allchats[106,0] = '[some', 'word', 'blabla', 'some', 'more', 'words', 'yes']
allchats[410,0] = ['I', 'dont', 'know', 'how', 'we', 'will', 'ever', 'get', 'through', 'this']

我很确定我在for循环中对字符串(str)做错了,但是无法弄清楚我需要纠正什么才能成功。

提前感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

你的列表理解有错字错误,它没有采用嵌套列表,但链接列表:

allchats1 = [tknzr.tokenize(chat) for str in allchats for chat in str]

如果你想迭代单词而不只是字符,那么你正在寻找str.split()方法。所以here is a fully working exmple

allchats = ["Hi, have you ever seen something like that? no?", "some word blabla some more words yes", "I don't know how we will ever get through this..."]

def tokenize(word):
    # use real logic here
    return word + 'tokenized'

tokenized = [tokenize(word) for sentence in allchats for word in sentence.split()]

print(tokenized)

如果您不确定列表中只有字符串,并且只想查看字符串,可以使用isinstance方法(example here)进行检查:

tokenized = [tokenize(word) for sentence in allchats if isinstance(sentence, str) for word in sentence.split()]