假设我有几个文档和一个df列,其中包含我需要搜索的特定单词,如何计算单词出现在文档中的次数?
一个例子更好。
示例:
doc1 = "I am a cat that barks. I like dog food instead of cat food. Roff"
doc2 = "Frog that barks. Frog like cats."
df['words'] = ["dog","cat","frog"]
希望它变成一个看起来像这样的df。
它看起来像这样,但我意识到它只是循环到同一个单元格。所以我总是零。
for i in range(len(doc)):
for key, value in doc.items():
for word in df['word']:
df['doc_' + str(i)] = value.count(word)
答案 0 :(得分:0)
doc1 = "I am a cat that barks. I like dog food instead of cat food. Roff"
doc2 = "Frog that barks. Frog like cats."
strings = [doc1, doc2]
words = ["dog","cat","frog"]
def count_occ(word, sentence):
return sentence.lower().split().count(word)
cts = []
def counts_df(strings, words):
for w in words:
for s in strings:
cts.append(count_occ(w, s))
df = pd.DataFrame(np.array(cts).reshape((len(words), len(strings))),
index=words,
columns=['doc' + str(i) for i in range(1, len(strings) + 1)])
return df
counts_df(strings, words)
Out[61]:
doc1 doc2
dog 1 0
cat 2 0
frog 0 2