我的目标是根据关键字从一组单词文档中提取特定部分。我无法从较大的文本文件数据集中解析出文本的特定部分。数据集最初看起来像这样,用“标题1”和“标题2”表示我感兴趣的文本的开头和结尾,不重要的词表示我不感兴趣的文本文件的一部分:>
**Text** **Text File**
title one Text file 1
sentence one Text file 1
sentence two Text file 1
title two Text file 1
unimportant words Text file 1
title one Text file 2
sentence one Text file 2
然后我用作字符将数据转换为字符,并使用unnest_tokens整理数据
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
tidy_df <- df %>% unnest_tokens(word, Text, token = "words")
我现在只想查看数据集中的句子,并排除不重要的词。每个文本文件中的标题一和标题二相同,但是它们之间的句子不同。我已经在下面尝试过此代码,但是它似乎不起作用。
filtered_resume <- lapply(tidy_resume, (tidy_resume %>% select(Name) %>% filter(title:two)))
答案 0 :(得分:1)
如果您想要一个涉及很少代码行的tidyverse选项,请看一下。您可以使用True
和!=
在数据框中找到包含重要/不重要信号的行。
123 == '123'
现在,您可以使用tidyr中的False
来填充这些值。
case_when()
由reprex package(v0.2.0)于2018-08-14创建。
此时,您可以str_detect()
仅保留所需的文本,然后可以使用tidytext中的函数对剩下的重要文本进行文本挖掘。
答案 1 :(得分:0)
不熟悉tidytext
软件包,因此这是替代的基础R解决方案。使用此扩展的示例数据(创建代码位于底部):
> df
Text File
1 title one Text file 1
2 sentence one Text file 1
3 sentence two Text file 1
4 title two Text file 1
5 unimportant words Text file 1
6 title one Text file 2
7 sentence one Text file 2
8 sentence two Text file 2
9 sentence three Text file 2
10 title two Text file 2
11 unimportant words Text file 2
根据Text
列中的值,使一个函数构成一个单独的列,该列指示应保留还是删除给定的行。评论中的详细信息:
get_important_sentences <- function(df_) {
# Create some variables for filtering
val = 1
keep = c()
# For every text row
for (x in df_$Text) {
# Multiply the current val by 2
val = val * 2
# If the current text includes "title",
# set val to 1 for 'title one', and to 2
# for 'title two'
if (grepl("title", x)) {
val = ifelse(grepl("one", x), 1, 0)
}
# append val to keep each time
keep = c(keep, val)
}
# keep is now a numeric vector- add it to
# the data frame
df_$keep = keep
# exclude any rows where 'keep' is 1 (for
# 'title one') or 0 (for 'title 2' or any
# unimportant words). Also, drop the
return(df_[df_$keep > 1, c("Text", "File")])
}
然后您可以在整个数据帧中调用它:
> get_important_sentences(df)
Text File
2 sentence one Text file 1
3 sentence two Text file 1
7 sentence one Text file 2
8 sentence two Text file 2
9 sentence three Text file 2
或使用lapply
在每个文件源的基础上:
> lapply(split(df, df$File), get_important_sentences)
$`Text file 1`
Text File
2 sentence one Text file 1
3 sentence two Text file 1
$`Text file 2`
Text File
7 sentence one Text file 2
8 sentence two Text file 2
9 sentence three Text file 2
数据:
df <-
data.frame(
Text = c(
"title one",
"sentence one",
"sentence two",
"title two",
"unimportant words",
"title one",
"sentence one",
"sentence two",
"sentence three",
"title two",
"unimportant words"
),
File = c(rep("Text file 1", 5), rep("Text file 2", 6)),
stringsAsFactors = FALSE
)