标记文本数据集相对于R中文本中的特定单词

时间:2015-11-03 08:19:00

标签: r text-analysis

我是R的新手,但我有需要创建新变量标志的情况,并将其标记为文本中特定单词的1。 例如:数据帧

Text                                        flag_USA    flag_Canada
Canada has 1.6% more total area                  0         1
USA has 0.7% more land                           1         0
USA has 4 times more arable land in total        1         0
Canada has 27.5% more forested and wooded land   0         1
USA has 26.9 times more irrigated land           1         0

所以我想创建一个flagvariable,其中有一个文本中有美国或加拿大。 能否请你帮我解决这个问题的代码。提前感谢您提出的宝贵建议。

3 个答案:

答案 0 :(得分:1)

使用grepl函数,如果在字符串中找到模式,则grepl返回TRUE,如果没有,则返回FALSE

您的代码可能如下所示:

df$flag_USA    <- grepl("USA",    df$Text)
df$flag_Canada <- grepl("Canada", df$Text)

如果你真的需要数字而不是TRUE / FALSE,你可以使用as.integer将TRUE / FALSE转换为1/0。

答案 1 :(得分:1)

我们还可以使用binary提取单词并将其转换为grepl的二进制列。当有很多关键字并且不想重复执行df1[c('flag_USA', 'flag_Canada')] <- table(1:nrow(df1), factor(regmatches(df1$Text, regexpr('USA|Canada', df1$Text)), levels=c('USA', 'Canada'))) df1 # Text flag_USA flag_Canada #1 Canada has 1.6% more total area 0 1 #2 USA has 0.7% more land 1 0 #3 USA has 4 times more arable land in total 1 0 #4 Canada has 27.5% more forested and wooded land 0 1 #5 USA has 26.9 times more irrigated land 1 0

时,此功能非常有用
Caused by java.lang.NullPointerException: Attempt to get length of null array
    at nezibo.com.dreamception.utils.FileUtils.removeTMPFiles(FileUtils.java:31)

答案 2 :(得分:0)

使用 quanteda 包将词典应用于文本是一项完美的任务。首先,您使用模式定义一个字典(可以使用“glob”匹配,固定格式或正则表达式进行匹配 - 请参阅?dfm?applyDictionary),然后在文本上使用此字符创建使用dfm()dictionary参数的文档特征矩阵。

> txt <- c("Canada has 1.6% more total area",
+          "USA has 0.7% more land",
+          "USA has 4 times more arable land in total",
+          "Canada has 27.5% more forested and wooded land",
+          "USA has 26.9 times more irrigated land")
> require(quanteda)
> myDictionary <- dictionary(list(flag_USA = "USA", flag_Canada = "Canada"))
> dfm(txt, dictionary = myDictionary)
Creating a dfm from a character vector ...
   ... lowercasing
   ... tokenizing
   ... indexing documents: 5 documents
   ... indexing features: 14 feature types
   ... applying a dictionary consisting of 2 keys
   ... created a 5 x 2 sparse dfm
   ... complete. 
Elapsed time: 0.014 seconds.
Document-feature matrix of: 5 documents, 2 features.
5 x 2 sparse Matrix of class "dfmSparse"
       features
docs    flag_USA flag_Canada
  text1        0           1
  text2        1           0
  text3        1           0
  text4        0           1
  text5        1           0