我有一个看起来像这样的字符串:
string <- c("A,1,some text,200", "B,2,some other text,300", "A,3,yet another one,100")
因此,每个矢量元素都进一步用逗号分隔。 现在,我只想在特定位置提取元素。假设第一个逗号之前的所有元素或第二个逗号之后的所有元素。
以下代码可以满足我的要求:
sapply(strsplit(string, ","), function(x){return(x[[1]])})
# [1] "A" "B" "A"
sapply(strsplit(string, ","), function(x){return(x[[3]])})
# [1] "some text" "some other text" "yet another one"
但是,对于我来说,这段代码似乎相当复杂(考虑到问题的简单性)。有更简洁的选择可以实现我想要的吗?
答案 0 :(得分:7)
1)data.frame 转换为数据框,然后很容易选择一列或列的子集:
DF <- read.table(text = string, sep = ",", as.is = TRUE)
DF[[1]]
## [1] "A" "B" "A"
DF[[3]]
## [1] "some text" "some other text" "yet another one"
DF[-1]
## V2 V3 V4
## 1 1 some text 200
## 2 2 some other text 300
## 3 3 yet another one 100
DF[2:3]
## V2 V3
## 1 1 some text
## 2 2 some other text
## 3 3 yet another one
2)data.table :: tranpose data.table包具有对列表进行转置的功能,因此,如果stringt
是转置列表,则stringt[[3]]
是矢量例如,以与(1)相似的方式表示第三字段。更加紧凑的是下面的@Henrik提到的data.table的tstrsplit
或下面的@akrun提到的同一包的fread
。
library(data.table)
stringt <- transpose(strsplit(string, ","))
# or
stringt <- tstrsplit(string, ",")
stringt[[1]]
## [1] "A" "B" "A"
stringt[[3]]
## [1] "some text" "some other text" "yet another one"
stringt[-1]
## [[1]]
## [1] "1" "2" "3"
##
## [[2]]
## [1] "some text" "some other text" "yet another one"
##
## [[3]]
## [1] "200" "300" "100"
stringt[2:3]
## [[1]]
## [1] "1" "2" "3"
##
## [[2]]
## [1] "some text" "some other text" "yet another one"
purrr也具有transpose
功能,但
library(purrr)
transpose(strsplit(string, ","))
产生列表列表,而不是字符向量列表。
答案 1 :(得分:6)
一种选择是将word
中的stringr
与sep
参数一起使用
library(stringr)
word(string, 1, sep = ",")
#[1] "A" "B" "A"
word(string, 3, sep = ",")
#[1] "some text" "some other text" "yet another one"
由于word
的性能是最差的,因此我发现在基R中使用正则表达式的另一种选择。
#Get 1st element
sub("(?:[^,],){0}([^,]*).*", "\\1",string)
#[1] "A" "B" "A"
#Get 3rd element
sub("(?:[^,],){2}([^,]*).*", "\\1",string)
#[1] "some text" "some other text" "yet another one"
这里有两个要匹配的组。第一个匹配任何不是逗号的字符,然后连续n
次使用逗号,然后再次匹配另一组不是逗号的字符。第一组未捕获(?:
),而第二组已捕获并返回。另请注意,方括号({}
)中的数字必须比我们想要的单词少1。因此{0}
返回第一个单词,{2}
返回第三个单词。
基准
string <- c("A,1,some text,200","B,2,some other text,300","A,3,yet another one,100")
string <- rep(string, 1e5)
library(microbenchmark)
microbenchmark(
tmfmnk_sapply = sapply(strsplit(string, ","), function(x) x[1]),
tmfmnk_tstrsplit = tstrsplit(string, ",")[[1]],
avid_useR_sapply = sapply(strsplit(string, ","), '[', 1),
avid_useR_str_split = str_split(string, ",", simplify = TRUE)[,1],
Ronak_Shah_word = word(string, 1, sep = ","),
Ronak_Shah_sub = sub("(?:[^,],){0}([^,]*).*", "\\1",string),
G_Grothendieck ={DF <- read.table(text = string, sep = ",",as.is = TRUE);DF[[1]]},
times = 5
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# tmfmnk_sapply 1629.69 1641.61 2128.14 1834.99 1893.43 3640.96 5
# tmfmnk_tstrsplit 1269.94 1283.79 1286.29 1286.68 1290.76 1300.30 5
# avid_useR_sapply 1445.40 1447.64 1555.76 1498.14 1609.52 1778.13 5
#avid_useR_str_split 324.68 332.28 332.30 333.97 334.01 336.54 5
# Ronak_Shah_word 6571.29 6810.92 6956.20 6930.86 7217.26 7250.69 5
# Ronak_Shah_sub 349.76 354.77 356.91 358.91 359.17 361.94 5
# G_Grothendieck 354.93 358.24 364.43 362.24 367.79 378.94 5
我没有包括Christoph的解决方案,因为我不清楚它如何在变量n
上起作用。例如第3位,第4位等。
答案 2 :(得分:5)
我们可以将OP的代码简化为:
sapply(strsplit(string, ","), '[', 1)
# [1] "A" "B" "A"
sapply(strsplit(string, ","), '[', 3)
# [1] "some text" "some other text" "yet another one"
此外,使用stringr::str_split
和simplify = TRUE
,我们可以直接索引该列,因为输出将是一个矩阵:
library(stringr)
str_split(string, ",", simplify = TRUE)[,1]
# [1] "A" "B" "A"
str_split(string, ",", simplify = TRUE)[,3]
# [1] "some text" "some other text" "yet another one"
答案 3 :(得分:3)
与sapply()
的版本略有不同:
sapply(strsplit(string, ","), function(x) x[1])
[1] "A" "B" "A"
sapply(strsplit(string, ","), function(x) x[3])
[1] "some text" "some other text" "yet another one"
或者可以使用tstrsplit
中的data.table
:
tstrsplit(string, ",")[[1]]
[1] "A" "B" "A"
不同解决方案的基准:
library(microbenchmark)
microbenchmark(
tmfmnk_sapply = sapply(strsplit(string, ","), function(x) x[1]),
tmfmnk_tstrsplit = tstrsplit(string, ",")[[1]],
avid_useR_sapply = sapply(strsplit(string, ","), '[', 1),
avid_useR_str_split = str_split(string, ",", simplify = TRUE)[,1],
Ronak_Shah = word(string, 1, sep = ","),
times = 5
)
expr min lq mean median uq max neval cld
tmfmnk_sapply 34.543 36.395 45.8782 47.150 48.540 62.763 5 a
tmfmnk_tstrsplit 33.072 33.554 39.1166 35.012 36.116 57.829 5 a
avid_useR_sapply 39.612 45.292 61.1936 46.730 47.398 126.936 5 a
avid_useR_str_split 27.313 34.095 49.3412 43.834 43.977 97.487 5 a
Ronak_Shah 146.875 147.277 199.4978 162.995 218.322 322.020 5 b
复制的“字符串”上的基准:
string <- rep(string, 1e5)
microbenchmark(
tmfmnk_sapply = sapply(strsplit(string, ","), function(x) x[1]),
tmfmnk_tstrsplit = tstrsplit(string, ",")[[1]],
avid_useR_sapply = sapply(strsplit(string, ","), '[', 1),
avid_useR_str_split = str_split(string, ",", simplify = TRUE)[,1],
Ronak_Shah = word(string, 1, sep = ","),
Christoph = regmatches(string, regexpr("^[^,]", string)),
times = 5
)
expr min lq mean median uq max neval
tmfmnk_sapply 1529.8955 1608.2909 1926.7776 1820.0443 2105.9736 2569.6836 5
tmfmnk_tstrsplit 1277.8712 1281.0371 1482.4520 1314.0074 1599.7686 1939.5757 5
avid_useR_sapply 1428.7175 1470.9002 1487.5425 1483.1127 1521.3735 1533.6087 5
avid_useR_str_split 306.2633 316.7539 360.8785 334.8516 335.5375 510.9863 5
Ronak_Shah 5541.6199 5657.3593 5955.9653 6068.1067 6166.7249 6346.0157 5
Christoph 231.0496 244.1049 383.9702 246.0421 273.2877 925.3667 5
答案 4 :(得分:2)
可以使用regepr使用base R来完成:
regmatches(string, regexpr("^[^,]", string))
[1] "A" "B" "A"
regmatches(string, regexpr("[^,]*$", string))
[1] "200" "300" "100"
regmatches(string, regexpr("[^,]*,[^,]*$", string))
[1] "some text,200" "some other text,300" "yet another one,100"