Question

我正在努力解决一个问题，我确信有一个简单的解决方案，但我一直无法找到它。谢谢你的帮助。

每当发生单独的矢量元素时，我都会尝试拆分一串文本。如下所示：

fruits<-c("APPLE","BANANA","ORANGE")
string<-("This is a list of fruits and their properties. 
         APPLE This is a red fruit, typically very SWEET! 
         BANANA This is a yellow fruit, also sweet! 
         ORANGE This is an orange fruit and also, yes, sweet")

我想要的输出是4个元素的列表/向量，每个元素包含在'fruits'的任何元素出现之前/之后的字符串的分割。所以，像：

c("This is a list of fruits and their properties",
"APPLE This is a red fruit, typically very SWEET!",
"BANANA This is a yellow fruit, also sweet!,
"ORANGE This is an orange fruit and also, yes, sweet")

我试过了

strsplit(string,split=fruits)

除了其他几件事，但没有成功。我实际上要做的是将我已经转换为.txt的.pdf代码簿分成一个单词列表（国家/地区），它们对应于代码簿的各个部分。

提前致谢！

Answer 1

“我真的不想考虑正则表达式”的方式就是这样做：

strsplit(gsub(sprintf('(%s)', paste(fruits, collapse = "|")), 
              "MYSPLIT\\1", string), 
         "MYSPLIT", TRUE)[[1]]
# [1] "This is a list of fruits and their properties. \n         "  
# [2] "APPLE This is a red fruit, typically very SWEET! \n         "
# [3] "BANANA This is a yellow fruit, also sweet! \n         "      
# [4] "ORANGE This is an orange fruit and also, yes, sweet"

在那里，我基本上匹配了APPLE，ORANGE和BANANA，并用MYSPLITAPPLE等替换它们，给我一个新的分隔符（MYSPLIT），在其上分割字符串。

Answer 2

您可以使用正则表达式lookarounds

 strsplit(string, sprintf('\\s+(?=%s)',
            paste(fruits, collapse='|')), perl=TRUE)[[1]]
 #[1] "This is a list of fruits and their properties."     
 #[2] "APPLE This is a red fruit, typically very SWEET!"   
 #[3] "BANANA This is a yellow fruit, also sweet!"         
 #[4] "ORANGE This is an orange fruit and also, yes, sweet"

在指定向量的每个元素处拆分字符串

2 个答案: