Question

我的目标是根据关键字从一组单词文档中提取特定部分。我无法从较大的文本文件数据集中解析出文本的特定部分。数据集最初看起来像这样，用“标题1”和“标题2”表示我感兴趣的文本的开头和结尾，不重要的词表示我不感兴趣的文本文件的一部分：

**Text**           **Text File** 
title one           Text file 1
sentence one        Text file 1
sentence two        Text file 1
title two           Text file 1
unimportant words   Text file 1
title one           Text file 2
sentence one        Text file 2

然后我用作字符将数据转换为字符，并使用unnest_tokens整理数据

df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
tidy_df <- df %>% unnest_tokens(word, Text, token = "words")

我现在只想查看数据集中的句子，并排除不重要的词。每个文本文件中的标题一和标题二相同，但是它们之间的句子不同。我已经在下面尝试过此代码，但是它似乎不起作用。

filtered_resume <- lapply(tidy_resume, (tidy_resume %>% select(Name) %>% filter(title:two)))

Answer 1

如果您想要一个涉及很少代码行的tidyverse选项，请看一下。您可以使用True和!=在数据框中找到包含重要/不重要信号的行。

123 == '123'

现在，您可以使用tidyr中的False来填充这些值。

case_when()

由reprex package（v0.2.0）于2018-08-14创建。

此时，您可以str_detect()仅保留所需的文本，然后可以使用tidytext中的函数对剩下的重要文本进行文本挖掘。

Answer 2

不熟悉tidytext软件包，因此这是替代的基础R解决方案。使用此扩展的示例数据（创建代码位于底部）：

> df
                Text        File
1          title one Text file 1
2       sentence one Text file 1
3       sentence two Text file 1
4          title two Text file 1
5  unimportant words Text file 1
6          title one Text file 2
7       sentence one Text file 2
8       sentence two Text file 2
9     sentence three Text file 2
10         title two Text file 2
11 unimportant words Text file 2

根据Text列中的值，使一个函数构成一个单独的列，该列指示应保留还是删除给定的行。评论中的详细信息：

get_important_sentences <- function(df_) {
  # Create some variables for filtering
  val = 1
  keep = c()

  # For every text row
  for (x in df_$Text) {
    # Multiply the current val by 2
    val = val * 2

    # If the current text includes "title",
    # set val to 1 for 'title one', and to 2
    # for 'title two'
    if (grepl("title", x)) {
      val = ifelse(grepl("one", x), 1, 0)
    }

    # append val to keep each time
    keep = c(keep, val)
  }

  # keep is now a numeric vector- add it to
  # the data frame
  df_$keep = keep

  # exclude any rows where 'keep' is 1 (for
  # 'title one') or 0 (for 'title 2' or any
  # unimportant words). Also, drop the
  return(df_[df_$keep > 1, c("Text", "File")])
}

然后您可以在整个数据帧中调用它：

> get_important_sentences(df)
            Text        File
2   sentence one Text file 1
3   sentence two Text file 1
7   sentence one Text file 2
8   sentence two Text file 2
9 sentence three Text file 2

或使用lapply在每个文件源的基础上：

> lapply(split(df, df$File), get_important_sentences)
$`Text file 1`
          Text        File
2 sentence one Text file 1
3 sentence two Text file 1

$`Text file 2`
            Text        File
7   sentence one Text file 2
8   sentence two Text file 2
9 sentence three Text file 2

数据：

df <-
  data.frame(
    Text = c(
      "title one",
      "sentence one",
      "sentence two",
      "title two",
      "unimportant words",
      "title one",
      "sentence one",
      "sentence two",
      "sentence three",
      "title two",
      "unimportant words"
    ),
    File = c(rep("Text file 1", 5), rep("Text file 2", 6)),
    stringsAsFactors = FALSE
  )

如何解析文本的特定部分？

2 个答案: