使用grep函数进行文本挖掘

时间:2018-05-10 05:37:07

标签: r twitter grep text-mining

我在评分数据时遇到问题。以下是数据集。 text 是我想要进行文本挖掘和情感分析的推文

**text**                                         **call    bills    location**
-the bill was not generated                           0        bill       0
-tried to raise the complaint                         0         0         0 
-the location update failed                           0         0       location
-the call drop has increased in my location         call        0       location
-nobody in the location received bill,so call ASAP  call      bill      location

这是DUMMY DATA,其中Text是我尝试进行文本挖掘的列,我在R中使用grep函数创建列(例如账单,电话,位置)以及是否有任何行中的账单,列名称写入法案,同样适用于所有其他类别。

vdftweet$app = ifelse(grepl('app',tolower(vdftweet$text)),'app',0)
table(vdftweet$app)

现在,我无法理解的问题是

我想创建一个新列“category_name”,其下每行应该给出它们所属类别的名称。如果每条推文标记超过3个类别,则将其标记为“其他”。否则给出类别的名称。

1 个答案:

答案 0 :(得分:1)

使用tidyverse包有两种方法可以做到这一点。在第一种方法中,mutate用于将类别名称作为列添加到文本data.frame中,类似于您拥有的内容。然后使用gather将其转换为键值格式,其中类别是category_name列中的值。

替代方法是直接使用键值格式,其中类别是category_name列中的值。如果行分为多个类别,则重复行。如果您不需要带有类别作为列名的第一个表单,则替代方法可以更灵活地添加新类别并且需要更少的处理。

在这两种方法中,str_match都包含与文本类别匹配的正则表达式。这里的模式是微不足道的,但如果需要可以使用更复杂的模式。

代码如下:

library(tidyverse)
#
# read dummy data into data frame
#
   dummy_dat <- read.table(header = TRUE,stringsAsFactors = FALSE, 
                      strip.white=TRUE, sep="\n",
          text= "text
            -the bill was not generated
          -tried to raise the complaint
          -the location update failed
          -the call drop has increased in my location
          -nobody in the location received bill,so call ASAP")
#
#  form data frame with categories as columns
#
   dummy_cats <-  dummy_dat %>% mutate(text = tolower(text),
                               bill = str_match(.$text, pattern="bill"), 
                               call = str_match(.$text,  pattern="call"), 
                               location = str_match(.$text, pattern="location"),
                               other = ifelse(is.na(bill) & is.na(call) &
                                              is.na(location), "other",NA))
#
#  convert categories as columns to key-value format
#  withcategories as values in category_name column
#

   dummy_cat_name <- dummy_cats %>% 
               gather(key = type, value=category_name, -text,na.rm = TRUE) %>%
               select(-type) 

#
#---------------------------------------------------------------------------
#
#  ALTERNATIVE:  go directly from text data to key-value format with categories
#  as values under category_name
#  Rows are repeated if they fall into multiple categories
#  Rows with no categories are put in category other
#
   dummy_dat <- dummy_dat %>% mutate(text=tolower(text))
   dummy_cat_name1 <- data.frame(text = NULL, category_name =NULL)
   for( cat in c("bill", "call", "location")) {
      temp <-  dummy_dat %>% mutate(category_name = str_match(.$text, pattern=cat)) %>% na.omit() 
      dummy_cat_name1 <- dummy_cat_name1 %>% bind_rows(temp) 
    }
    dummy_cat_name1 <- left_join(dummy_dat, dummy_cat_name1, by = "text") %>%
               mutate(category_name = ifelse(is.na(category_name), "other", category_name))

结果是

 dummy_cat_name1
                                            text      category_name
                            -the bill was not generated          bill
                          -tried to raise the complaint         other
                            -the location update failed      location
            -the call drop has increased in my location          call
            -the call drop has increased in my location      location
     -nobody in the location received bill,so call asap          bill
     -nobody in the location received bill,so call asap          call
     -nobody in the location received bill,so call asap      location