我有一个名为allchats的数组,由长字符串组成。索引中的一些位置如下所示:
allchats[5,0] = "Hi, have you ever seen something like that? no?"
allchats[106,0] = "some word blabla some more words yes"
allchats[410,0] = "I don't know how we will ever get through this..."
我希望对数组中的每个字符串进行标记。此外,我希望使用正则表达式工具来消除问号,逗号等。
我尝试了以下内容:
import nltk
from nltk.tokenize import RegexTokenizer
tknzr = RegexTokenizer('\w+')
allchats1 = [[tknzr.tokenize(chat) for chat in str] for str in allchats]
我希望最终得到:
allchats[5,0] = ['Hi', 'have', 'you', 'ever', 'seen', 'something', 'like', 'that', 'no']
allchats[106,0] = '[some', 'word', 'blabla', 'some', 'more', 'words', 'yes']
allchats[410,0] = ['I', 'dont', 'know', 'how', 'we', 'will', 'ever', 'get', 'through', 'this']
我很确定我在for循环中对字符串(str)做错了,但是无法弄清楚我需要纠正什么才能成功。
提前感谢您的帮助!
答案 0 :(得分:0)
你的列表理解有错字错误,它没有采用嵌套列表,但链接列表:
allchats1 = [tknzr.tokenize(chat) for str in allchats for chat in str]
如果你想迭代单词而不只是字符,那么你正在寻找str.split()
方法。所以here is a fully working exmple:
allchats = ["Hi, have you ever seen something like that? no?", "some word blabla some more words yes", "I don't know how we will ever get through this..."]
def tokenize(word):
# use real logic here
return word + 'tokenized'
tokenized = [tokenize(word) for sentence in allchats for word in sentence.split()]
print(tokenized)
如果您不确定列表中只有字符串,并且只想查看字符串,可以使用isinstance
方法(example here)进行检查:
tokenized = [tokenize(word) for sentence in allchats if isinstance(sentence, str) for word in sentence.split()]