我有一个包含文本的列的数据集,如下所示
Column1
----------------------------------------------------------
dapagliflozin 10 MG / metFORMIN hydrochloride
dapagliflozin 5 MG / metFORMIN hydrochloride
Fortamet
Glucophage
Glumetza
metFORMIN hydrochloride
metFORMIN hydrochloride / pioglitazone 15 MG
metFORMIN hydrochloride / pioglitazone 30 MG
我正在尝试获取每个唯一单词的单词计数,例如,metFormin的单词计数,盐酸盐的单词计数等。我需要帮助;我尝试了表函数,但是它将整行用作一个单词,这没有用。
答案 0 :(得分:2)
我们可以使用strsplit/unlist/table
的组合。使用strsplit
拆分列字符串,将split
指定为空格(\\s+
)。输出将是list
。使用unlist
将列表更改为矢量,然后使用table
获取计数。
table(unlist(strsplit(yourdf$Column1, '\\s+'))
答案 1 :(得分:1)
或者使用为此设计的文本分析包:
> require(quanteda)
> dfm(myColumn)
Creating a dfm from a character vector ...
... lowercasing
... tokenizing
... indexing 1 document
... shaping tokens into data.table, found 21 total tokens
... summing tokens by document
... indexing 8 feature types
... building sparse matrix
... created a 1 x 8 sparse dfm
... complete. Elapsed time: 0.047 seconds.
Document-feature matrix of: 1 document, 8 features.
1 x 8 sparse Matrix of class "dfmSparse"
features
docs dapagliflozin fortamet glucophage glumetza hydrochloride metformin mg pioglitazone
text1 2 1 1 1 5 5 4 2