删除正则表达式,将文本拆分为段落,然后在R中应用ifelse时出错

时间:2018-06-19 08:50:32

标签: r dplyr tidyr tidyverse tidytext

我正在努力将regexm拆分文本删除到段落中,然后将IFELSE应用于数据帧。我期待着你的帮助。 谢谢。

我希望在数据框中的每个文本的第一段中搜索单词。此后,我有搜索我要搜索的单词。如果出现的话,输入1,否则为0.

下面是表格。

data<-structure(list(ID = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2"), class = "factor"), 
    Text = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017  he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.", 
    "\\n\\t\\t\\t\\t\\t \\n \\n   The soccer world cup is entralling. \\nEveryone  acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
    ), class = "factor")), .Names = c("ID", "Text"), row.names = c(NA, 
-15L), class = "data.frame")

对于文字列中的条目数,我正在搜索以下字词

library(stringr)
library(stringi)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(dplyr)
words<-c("field", "ocean", "glamor showcases")

我尝试了以下内容:

删除不需要的正则表达式。

当我尝试删除&#34; \ t&#34;和&#34; \ n&#34;,我收到以下错误:

data1<-data %>% mutate(Text=gsub("\\t",Text,""))
  

警告信息:在gsub中(&#34; \ t&#34;,文字,&#34;&#34;):参数&#39;替换&#39;   长度> 1,只使用第一个元素

按段落分割

data1<-data %>% mutate(Text2=Text) %>% unnest_tokens("Text3",Text2,token="paragraphs")

如果存在单词,则为1,否则为0和最终表。

finaldata<-structure(list(ID = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2"), class = "factor"), 
    Text = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017  he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.", 
    "\\n\\t\\t\\t\\t\\t \\n \\n   The soccer world cup is entralling. \\nEveryone  acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
    ), class = "factor"), field = structure(c(2L, 3L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "0", "1"), class = "factor"), country = structure(c(3L, 2L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "0", "1"), class = "factor"), glamor.showcases = structure(c(2L, 
    3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "0", "1"), class = "factor")), .Names = c("ID", "Text", "field", 
"country", "glamor.showcases"), row.names = c(NA, -15L), class = "data.frame")

任何帮助将不胜感激。 谢谢。

我见过以下资源 -

  1. Count word occurrences in R

  2. How to find that a word/words in a column is present in another column consisting a sentence [duplicate]

  3. Split by paragraph in R

  4. Split text file into paragraph files in R

1 个答案:

答案 0 :(得分:1)

您可以尝试此操作,假设df$Text中的新段落从\n\n

开始
#search df$Text to find if it contains strings present in 'words' vector in its first paragraph
words_df <- do.call(cbind, lapply(words, function(x) 
  as.numeric(grepl(x, gsub("\n\n.*$", "", df$Text), ignore.case = T))))
colnames(words_df) <- words

#above outcome is combined with original dataframe to have the final result
final_df <- cbind(df, words_df)

给出了

> final_df[, -(1:2)]
  field country glamor showcases
1     0       1                0
2     1       0                1


示例数据:

df <- structure(list(ID = structure(2:3, .Label = c("", "1", "2"), class = "factor"), 
    Text = structure(2:3, .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017  he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.", 
    "\\n\\t\\t\\t\\t\\t \\n \\n   The soccer world cup is entralling. \\nEveryone  acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
    ), class = "factor")), .Names = c("ID", "Text"), row.names = 1:2, class = "data.frame")

words<-c("field", "country", "glamor showcases")