应用错误收集

R在矩阵到数据帧转换中列出

时间：2015-01-29 16:08:33

标签： r list dataframe

R挣扎。我使用以下内容从文本中提取引文，在大型数据集上有多个结果。我试图让输出成为数据帧中的字符串，因此我可以轻松地将其作为csv与其他人共享。

示例数据：

normalCase <- 'He said, "I am a test," very quickly.'
endCase <- 'This is a long quote, which we said, "Would never happen."'
shortCase <- 'A "quote" yo';
beginningCase <- '"I said this," he said quickly';
multipleCase <- 'When asked, "No," said Sam "I do not like green eggs and ham."'
testdata = c(normalCase,endCase,shortCase,beginningCase,multipleCase)

使用以下内容提取引文和字符缓冲区：

result <-function(testdata) {
  str_extract_all(testdata, '[^\"]?{15}"[^\"]+"[^\"]?{15}')
}
extract <- sapply(testdata, FUN=result)

摘录是矩阵中的列表。但是，我希望提取是一个字符串，以后我可以作为列合并到数据帧。我该怎么转换呢？

1 个答案:

答案 0 :(得分：1)

代码

normalCase <- 'He said, "I am a test," very quickly.'
endCase <- 'This is a long quote, which we said, "Would never happen."'
shortCase <- 'A "quote" yo';
beginningCase <- '"I said this," he said quickly';
multipleCase <- 'When asked, "No," said Sam "I do not like green eggs and ham."'
testdata = c(normalCase,endCase,shortCase,beginningCase,multipleCase)

# extract quotations
gsub(pattern = "[^\"]*((?:\"[^\"]*\")|$)", replacement = "\\1 ", x = testdata)

输出

[1] "\"I am a test,\"  "                            
[2] "\"Would never happen.\" "                      
[3] "\"quote\"  "                                   
[4] "\"I said this,\"  "                            
[5] "\"No,\" \"I do not like green eggs and ham.\" "

解释

pattern = "[^\"]"将匹配除双引号之外的任何字符
pattern = "[^\"]*"将匹配除双引号0或更多次以外的任何字符
pattern = "\"[^\"]*\""将匹配双引号，然后是任何字符除了双引号0或更多次，然后另一个双引用（即）引用
pattern = "(?:\"[^\"]*\")"将与引文匹配，但不会捕获它
pattern = "((?:\"[^\"]*\")|$)"将与引号或endOfString匹配，抓住它。请注意，这是我们捕获的第一个组
replacement = "\\1 "将替换为我们捕获的第一个组，后跟空格