我需要帮助从给定关键字的段落中获取句子并删除不必要的信息。
以下是我拥有的文件示例。
Heading Years Text
Head1 2015 <rrrt> I am a boy and I <rrr2> like a girl <t44> from my class. She is pretty. /rr /r /r I am cute.
Head2 2015 She is cute. She is beautiful.
Head3 2014 Hi, I am Jane. I play guitar. May is my friend.
我想用给定的关键字(am)提取句子。包含“我”的句子。另外,对于每个句子,我都希望得到标题和年份。并且摆脱不必要的信息,例如&lt; ***&gt; ,/ r。
以下是我想用R:
实现的输出Heading Years Text
Head1 2015 I am a boy and I like a girl from my class.
Head1 2015 I am a cute.
Head3 2014 Hi, I am Jane.
提前谢谢。
更新:
Heading Text
Apple "Jane is pretty." Good afternoon
Orange Tom said she is pretty. Also she is kind hearted. Tom listened in class.
Pear Added Lim, He is a great guy...and clever. Mary turned her head away.
我想得到的输出是:
Heading Text
Apple "Jane is pretty."
Orange Tom said she is pretty. Also she is kind hearted.
Pear Added Lim, He is a great guy...and clever.
我想捕捉人们说的话。谢谢。
答案 0 :(得分:2)
我们可以拆分&#39;文本&#39;每个句子末尾的列到list
,grep
以提取am
的句子,使用{{1}将list
转换为data.frame
},然后使用原始数据集stack
。
merge
注意:如果&#39;文字&#39;列为df2 <- stack(setNames(lapply(strsplit(df1$Text, '(?<=[.])(?=\\s*)\\s+',
perl=TRUE), grep, pattern='\\bam\\b', value=TRUE), df1$Heading))[2:1]
colnames(df2) <- colnames(df1)[c(1,3)]
res <- merge(df1[1:2], df2)
res
# Heading Years Text
#1 Head1 2015 I am a boy and I like a girl from my class.
#2 Head1 2015 I am cute.
#3 Head3 2014 Hi, I am Jane.
,在factor
中使用as.character(df1$Text)
。
对于新数据集,我们可以使用strsplit
删除<
和>
以及/r
之间的字符,然后像以前一样继续。
gsub
v1 <- gsub('\\<[^>]+\\>\\s*|/r+\\s*', '', df1N$Text, perl=TRUE)
Hi, I am Jane. Head3
df2N <- stack(setNames(lapply(strsplit(v1, '(?<=[.])(?=\\s*)\\s+',
perl=TRUE), grep, pattern='\\bam\\b', value=TRUE), df1N$Heading))[2:1]
colnames(df2N) <- colnames(df1N)[c(1,3)]
res1 <- merge(df1N[1:2], df2N)
res1
# Heading Years Text
#1 Head1 2015 I am a boy and I like a girl from my class.
#2 Head1 2015 I am cute.
#3 Head3 2014 Hi, I am Jane.
答案 1 :(得分:2)
head <- c("Head1", "Head2", "Head3")
years <- c(2015, 2015, 2014)
Text <- c("I am a boy and I like a girl from my class. She is pretty. I am cute.","She is cute. She is beautiful.", "Hi, I am Jane. I play guitar. May is my friend.")
#As strsplit doesn't work on factors, converting text to characters
df$Text <- as.character(df$Text)
df <- data.frame(head, years, Text)
words <- unlist(strsplit(df$Text, "[.]"))
test <- words[grep("am", words)]
i <- 0
a <- array()
for(i in 1:length(test)) {
a[i] <- grep(test[i], df$Text)
}
newdf <- data.frame(df[a, 1:2], test)
newdf
#head years test
#1 Head1 2015 I am a boy and I like a girl from my class
#1.1 Head1 2015 I am cute
#3 Head3 2014 Hi, I am Jane