我有一个数据框(称为corpus
),其中有一列(tweet
)和2行:
['check, tihs, out, this, bear, love, jumping, on, this, plant']
['i, can, t, bear, the, noise, from, that, power, plant, it, make, me, jump']
我在列中有一个唯一词列表(称为vocab
)
['check',
'tihs',
'out',
'this',
'bear',
'love',
'jumping',
'on',
'plant',
'i',
'can',
't',
'the',
'noise',
'from',
'that',
'power',
'it',
'make',
'me',
'jump']
我想为vocab中的每个单词添加一个新列。我希望新列的所有值都为零,除非tweet
包含单词,在这种情况下,我希望单词列的值为1。
所以我尝试运行以下代码:
for word in vocab:
corpus[word] = 0
corpus.loc[corpus["tweet"].str.contains(word), word] = 1
...,并显示以下错误:
"None of [Float64Index([nan, nan], dtype='float64')] are in the [index]"
如何检查推文中是否包含该单词,然后将其设置为1?
答案 0 :(得分:1)
您的corpus['tweet']
是列表类型,每个都是骨架。因此.str.contains
将返回NaN
。您可能要这样做:
# turn tweets into strings
corpus["tweet"] = [x[0] for x in corpus['tweet']]
# one-hot-encode
for word in vocab:
corpus[word] = 0
corpus.loc[corpus["tweet"].str.contains(word), word] = 1
但这可能不是您想要的,因为contains
将搜索所有子字符串,例如this girl goes to school
将在1
和is
两列中返回this
。
根据您的数据,您可以执行以下操作:
corpus["tweet"] = [x[0] for x in corpus['tweet']]
corpus = corpus.join(corpus['tweet'].str.get_dummies(', ')
.reindex(vocab, axis=1, fill_value=0)
)
答案 1 :(得分:0)
这可以做到:
from sklearn.feature_extraction.text import CountVectorizer
l = ['check, this, out, this, bear, love, jumping, on, this, plant',
'i, can, t, bear, the, noise, from, that, power, plant, it, make, me, jump']
vect = CountVectorizer()
X = pd.DataFrame(vect.fit_transform(l).toarray())
X.columns = vect.get_feature_names()
输出:
bear can check from it jump ... out plant power that the this
0 1 0 1 0 0 0 ... 1 1 0 0 0 3
1 1 1 0 1 1 1 ... 0 1 1 1 1 0