使用tm库将文本与正参考词列表进行比较以及返回正词出现次数的最佳方法是什么?我希望能够在参考文本中返回正词的总和。
问题:最好的方法是什么?
例如:
positiveword_list <- c("happy", "great", "fabulous", "great")
参考文字:
exampleText <- c("ON A BRIGHT SPRING DAY in the year 1677, “the good ship
Kent,” Captain Gregory Marlowe, Master, set sail from the great docks of London. She carried 230 English Quakers, outward bound for a new home in British North America. As the ship dropped down the Thames she was hailed by King Charles II, who happened to be sailing on the river. The two vessels made a striking contrast. The King’s yacht was sleek and proud in gleaming paintwork, with small cannons peeping through wreaths of gold leaf, a wooden unicorn prancing high above her prow, and the royal arms emblazoned upon her stern. She seemed to dance upon the water— new sails shining white in the sun, flags streaming bravely from her mastheads, officers in brilliant uniform, ladies in court costume, servants in livery, musicians playing, and spaniels yapping. At the center of attention was the saturnine figure of the King himself in all his regal splendor. On the other side of the river came the emigrant ship. She would have been bluff-bowed and round-sided, with dirty sails and a salt-stained hull, and a single ensign drooping from its halyard. Her bulwarks were lined with apprehensive passengers— some dressed in the rough gray homespun of the northern Pen-nines, others in the brown drab of London tradesmen, several in the blue suits of servant-apprentices, and a few in the tattered motley of the country poor.")
以下是一些背景知识:
我要做的是计算正数的数量并将数据存储在数据框中作为新列。
count <- length(which(lapply(positiveword_list, grepl, x = exampleText]) == TRUE))
因此:
dataframeIn %>% mutate( posCount <- (length(which(lapply(positiveword_list, grepl, x = text) == TRUE))))
其中text是dataFrameIn中的一列(即dataFrameIn $ text)
答案 0 :(得分:1)
您可以在不使用tm
包的情况下执行此操作。
试试这个
contained <- lapply(positiveword_list, grepl, x = exampleText)
lapply
返回一个列表。
出现的词语:
>positiveword_list[contained == T]
"great" "great"
>length(contained[contained==T])
2
不存在的词语:
>positiveword_list[contained == F]
"happy" "fabulous"
>length(contained[contained==F])
2
答案 1 :(得分:1)
这是使用自定义工具的另一种方法,您可以在其中定义正面词典并将其应用于任意数量的文本,以计算正面关键词。这使用 quanteda 包和dfm()
方法创建具有dictionary =
参数的文档特征矩阵。 (见?dictionary
。)
require(quanteda)
posDic <- dictionary(list(positive = positiveword_list))
myDfm <- dfm(exampleText, dictionary = posDic)
# Creating a dfm from a character vector ...
# ... lowercasing
# ... tokenizing
# ... indexing documents: 1 document
# ... indexing features: 157 feature types
# ... applying a dictionary consisting of 1 key
# ... created a 1 x 1 sparse dfm
# ... complete.
# Elapsed time: 0.014 seconds.
as.data.frame(myDfm)
# positive
# text1 1
# produces a data frame with the text and the positive count
cbind(text = exampleText, as.data.frame(myDfm))
注意:这可能对示例并不重要,但使用&#34;伟大&#34;在exampleText中不是一个积极的词。说明多义词和词典的危险。