我有这样的数据集:
Topic Content
Happy Jane said. "I am a happy girl!" She walked away.Running towards May."Hi, she said... I am bored. 2.8% BORED." haha
Sad Today is gloomy. "She added," said Saddy.
我想在引文中提取文本,并在引语之前提取1个句子。
之后,我想把每个引文和句子连在一起。
以下是我想要实现的输出:
Topic Content
Happy Jane said. "I am a happy girl!"
Happy Running towards May."Hi, she said...I am bored. 2.8% BORED"
Sad Today is gloomy. "She added," said Saddy.
以下是我的数据:
structure(list(Topic = structure(1:2, .Label = c("Happy", "Sad"),
class = "factor"), Content = structure(1:2, .Label = c("Jane said. \"I am a happy girl!\" She walked away.Running towards May.\"Hi, she said... I am bored. 2.8% BORED.\" haha", "Today is gloomy. \"She added,\" said Saddy."), class = "factor")), .Names = c("Topic", "Content"), class = "data.frame", row.names = c(NA, -2L))
我试过但它无法正常工作。我甚至无法提取:
df <- stack(setNames(lapply(gsub('[^"]+\\"([^)]+)\\".*', '\\1', x$Content, perl=TRUE), grep, pattern='\\"', value=TRUE), x$Topic))[2:1]
colnames(df) <- colnames(x)[c(1,2)]
res <- merge(x[1:1], df)
目前这是我的输出:
[1] "Jane said. \"I am a happy girl!\".Running towards May.\"Hi, she said... I am bored. 2.8% BORED.\""
[2] "Today is gloomy. \"She added,\"."