在R中拆分和提取字符串

时间:2015-05-07 09:40:56

标签: r string substring extract substr

规则

{Denny Frying Pan} => {Denny C-Size Batteries}

{Denny Scented Tissue} => {Denny Paper Plates}

{Blue Label Fancy Canned Clams} => {蓝标水罐头金枪鱼}

{Denny Plastic Forks} => {Golden Frozen Peas}

{Denny Frying Pan} => {Denny D-Size Batteries}

{Denny Plastic Forks} => {仿制品杏子洗发水}

{Golden Frozen Peas} => {Denny Plastic Forks}

{Faux Products Apricot Shampoo} => {Denny Plastic Forks}

{Blue Label罐装金枪鱼在水中} => {Blue Label Fancy Canned Clams}

{Blue Label Canned String Beans} => {Faux Products Buffered Aspirin}

{Denny D-Size Batteries} => {丹尼煎锅}

我有一个如上所述的单列数据框。 我想将上述规则分为LHS和RHS

LHS应包含{}之前=>之间的字符。 并且类似地,RHS应该包含在=>

之后的下一个{}之间包含的字符

我想知道如何在R中完成这项工作?

3 个答案:

答案 0 :(得分:1)

RULES <- c("{Denny Frying Pan} => {Denny C-Size Batteries}",
           "{Denny Scented Tissue} => {Denny Paper Plates}",
           "{Blue Label Fancy Canned Clams} => {Blue Label Canned Tuna in Water}",
           "{Denny Plastic Forks} => {Golden Frozen Peas}",
           "{Denny Frying Pan} => {Denny D-Size Batteries}",
           "{Denny Plastic Forks} => {Faux Products Apricot Shampoo}",
           "{Golden Frozen Peas} => {Denny Plastic Forks}",
           "{Faux Products Apricot Shampoo} => {Denny Plastic Forks}",
           "{Blue Label Canned Tuna in Water} => {Blue Label Fancy Canned Clams}",
           "{Blue Label Canned String Beans} => {Faux Products Buffered Aspirin}",
           "{Denny D-Size Batteries} => {Denny Frying Pan}")

df <- as.data.frame(do.call(rbind,strsplit(RULES,"} => {",fixed=TRUE)))
df[,1] <- gsub("{","",df[,1],fixed = TRUE)
df[,2] <- gsub("}","",df[,2],fixed = TRUE)

df
                                V1                              V2
1                 Denny Frying Pan          Denny C-Size Batteries
2             Denny Scented Tissue              Denny Paper Plates
3    Blue Label Fancy Canned Clams Blue Label Canned Tuna in Water
4              Denny Plastic Forks              Golden Frozen Peas
5                 Denny Frying Pan          Denny D-Size Batteries
6              Denny Plastic Forks   Faux Products Apricot Shampoo
7               Golden Frozen Peas             Denny Plastic Forks
8    Faux Products Apricot Shampoo             Denny Plastic Forks
9  Blue Label Canned Tuna in Water   Blue Label Fancy Canned Clams
10  Blue Label Canned String Beans  Faux Products Buffered Aspirin
11          Denny D-Size Batteries                Denny Frying Pan

答案 1 :(得分:0)

您可以尝试以下方法之一。两者都假设你从一个名为“rules”的角色向量开始。如果“规则”已经是data.frame中的列,则需要稍加修改。

library(splitstackshape)
library(dplyr)

data.table(rules = gsub("[{}]", "", gsub("=>", "\t", rules))) %>%
  cSplit("rules", "\t")
#                             rules_1                         rules_2
#  1:                Denny Frying Pan          Denny C-Size Batteries
#  2:            Denny Scented Tissue              Denny Paper Plates
#  3:   Blue Label Fancy Canned Clams Blue Label Canned Tuna in Water
#  4:             Denny Plastic Forks              Golden Frozen Peas
#  5:                Denny Frying Pan          Denny D-Size Batteries
#  6:             Denny Plastic Forks   Faux Products Apricot Shampoo
#  7:              Golden Frozen Peas             Denny Plastic Forks
#  8:   Faux Products Apricot Shampoo             Denny Plastic Forks
#  9: Blue Label Canned Tuna in Water   Blue Label Fancy Canned Clams
# 10:  Blue Label Canned String Beans  Faux Products Buffered Aspirin
# 11:          Denny D-Size Batteries                Denny Frying Pan

library(dplyr)
library(tidyr)

data.frame(rules) %>%
  mutate(rules = gsub("\\s+=>\\s+", "=>", rules)) %>%
  mutate(rules = gsub("[{}]", "", rules)) %>%
  separate(rules, into = c("V1", "V2"), sep = "=>")

答案 2 :(得分:0)

以下是我坚持使用 qdapRegex 的方法:

RULES <- c("{Denny Frying Pan} => {Denny C-Size Batteries}",
           "{Denny Scented Tissue} => {Denny Paper Plates}",
           "{Blue Label Fancy Canned Clams} => {Blue Label Canned Tuna in Water}",
           "{Denny Plastic Forks} => {Golden Frozen Peas}",
           "{Denny Frying Pan} => {Denny D-Size Batteries}",
           "{Denny Plastic Forks} => {Faux Products Apricot Shampoo}",
           "{Golden Frozen Peas} => {Denny Plastic Forks}",
           "{Faux Products Apricot Shampoo} => {Denny Plastic Forks}",
           "{Blue Label Canned Tuna in Water} => {Blue Label Fancy Canned Clams}",
           "{Blue Label Canned String Beans} => {Faux Products Buffered Aspirin}",
           "{Denny D-Size Batteries} => {Denny Frying Pan}")

library(qdapRegex)
setNames(do.call(rbind.data.frame, rm_curly(RULES, extract=TRUE)), c("LHS", "RHS"))

##                                LHS                             RHS
## 1                 Denny Frying Pan          Denny C-Size Batteries
## 2             Denny Scented Tissue              Denny Paper Plates
## 3    Blue Label Fancy Canned Clams Blue Label Canned Tuna in Water
## 4              Denny Plastic Forks              Golden Frozen Peas
## 5                 Denny Frying Pan          Denny D-Size Batteries
## 6              Denny Plastic Forks   Faux Products Apricot Shampoo
## 7               Golden Frozen Peas             Denny Plastic Forks
## 8    Faux Products Apricot Shampoo             Denny Plastic Forks
## 9  Blue Label Canned Tuna in Water   Blue Label Fancy Canned Clams
## 10  Blue Label Canned String Beans  Faux Products Buffered Aspirin
## 11          Denny D-Size Batteries                Denny Frying Pan

我们在大括号之间提取内容,然后使用do.call + rbind.data.frame强制转换为data.frame