如何更改和/或检查具有距离和分号的特定字符串

时间:2017-09-19 21:44:51

标签: r regex

我正在尝试加载文件夹中的文本文件(超过1000个)。我可以把它变成一个大的list现在我想检查是否存在特定的列名称并且我这样做

我做以下

sapply(my list, function(x) all(c("Transmittance: F112: Light, Sample " ) %in% names(x)))

还有许多其他列具有相同的名称,但我特别希望列具有Transmittance: F*

我有什么办法吗?因为最后我希望能够用其他列提取这些列

这是一个文件的一小部分

    ldf<- list(structure(list(`Transmitance Ratio: (F648, Light) / (F648, Heavy)` = c(NA, 
100, 0.768, NA, 0.676, NA, NA, 0.538, 0.482), `Transmitance  Ratio (log2): (F648, Light) / (F648, Heavy)` = c(NA, 
6.64, -0.38, NA, -0.56, NA, NA, -0.89, -1.05), `Transmitance s (Scaled): F648: Light, Sample` = c(NA, 
200, 86.9, NA, 80.7, NA, NA, 69.9, 65), `Transmitance s (Scaled): F648: Heavy, Sample` = c(NA, 
NA, 113.1, NA, 119.3, NA, NA, 130.1, 135), `Transmitance s (Normalized): F648: Light, Sample` = c(NA, 
2e+05, 6.46e+08, NA, 2720000, NA, NA, 25800000, 5380000), `Transmitance s (Normalized): F648: Heavy, Sample` = c(NA, 
NA, 8.42e+08, NA, 4030000, NA, NA, 4.8e+07, 11200000), `Transmitance : F648: Light, Sample` = c(NA, 
2e+05, 6.46e+08, NA, 2720000, NA, NA, 25800000, 5380000), `Transmitance : F648: Heavy, Sample` = c(NA, 
NA, 3.47e+08, NA, 1660000, NA, NA, 19700000, 4600000), `Transmitance s Count: F648: Light, Sample` = c(NA, 
1L, 44L, NA, 4L, NA, NA, 4L, 2L), `Transmitance s Count: F648: Heavy, Sample` = c(NA, 
NA, 44L, NA, 3L, NA, NA, 3L, 2L)), .Names = c("Transmitance Ratio: (F648, Light) / (F648, Heavy)", 
"Transmitance  Ratio (log2): (F648, Light) / (F648, Heavy)", 
"Transmitance s (Scaled): F648: Light, Sample", "Transmitance s (Scaled): F648: Heavy, Sample", 
"Transmitance s (Normalized): F648: Light, Sample", "Transmitance s (Normalized): F648: Heavy, Sample", 
"Transmitance : F648: Light, Sample", "Transmitance : F648: Heavy, Sample", 
"Transmitance s Count: F648: Light, Sample", "Transmitance s Count: F648: Heavy, Sample"
), row.names = c(NA, -9L), class = c("data.table", "data.frame"
)))

我只对使用任何扩展列

标识Transmitance : F感兴趣

3 个答案:

答案 0 :(得分:2)

你可以试试这个:

lapply(ldf, function(x) grep("^Transmitance : F.+", names(x), value = TRUE))

# [[1]]
# [1] "Transmitance : F648: Light, Sample" "Transmitance : F648: Heavy, Sample"
# 
# [[2]]
# [1] "Transmitance : F648: Light, Sample1" "Transmitance : F648: Heavy, Sample1"

要实际提取列,而不仅仅是名称:

library(dplyr)

lapply(ldf, function(x) select(x, starts_with("Transmitance : F")))

# [[1]]
#   Transmitance : F648: Light, Sample Transmitance : F648: Heavy, Sample
# 1                                 NA                                 NA
# 2                           2.00e+05                                 NA
# 3                           6.46e+08                           3.47e+08
# 4                                 NA                                 NA
# 5                           2.72e+06                           1.66e+06
# 6                                 NA                                 NA
# 7                                 NA                                 NA
# 8                           2.58e+07                           1.97e+07
# 9                           5.38e+06                           4.60e+06
# 
# [[2]]
#   Transmitance : F648: Light, Sample1 Transmitance : F648: Heavy, Sample1
# 1                                  NA                                  NA
# 2                            2.00e+05                                  NA
# 3                            6.46e+08                            3.47e+08
# 4                                  NA                                  NA
# 5                            2.72e+06                            1.66e+06
# 6                                  NA                                  NA
# 7                                  NA                                  NA
# 8                            2.58e+07                            1.97e+07
# 9                            5.38e+06                            4.60e+06

如果您希望将所有提取的列缩减为单个数据帧,则可以使用map_dfc中的purrr

library(purrr)
map_dfc(ldf, function(x) select(x, starts_with("Transmitance : F")))

map_dfc基本上将函数应用于提供列表的每个元素,并将所有元素的输出组合到带有cbind的数据框中。

数据:修改OP ldf以获得更好的演示:

ldf[[2]] = ldf[[1]]
names(ldf[[2]]) = paste0(names(ldf[[1]]), 1)

编辑

根据OP在评论中的附加要求,还要提取&#34;传输率&#34;列,只需更改grep的正则表达式:

lapply(ldf, function(x) grep("^Transmitance (: F|Ratio).+", names(x), value = TRUE))
start_with中的

select不会使用正则表达式,因此请改用matches

library(dplyr)
lapply(ldf, function(x) select(x, matches("^Transmitance (: F|Ratio).+")))

library(purrr)
map_dfc(ldf, function(x) select(x, matches("^Transmitance (: F|Ratio).+")))

答案 1 :(得分:1)

这将搜索字符串的开头以匹配模式并返回完整的字符串

lapply(ldf, function(x) grep(names(x), pattern = "^Transmitance : F", value = TRUE))

[[1]]
[1] "Transmitance : F648: Light, Sample" "Transmitance : F648: Heavy, Sample"

要提取这些列,请使用grepl和子集

lapply(seq_along(ldf), function(x) ldf[[x]][grepl(names(ldf[[x]]), pattern = "^Transmitance : F")])

[[1]]
  Transmitance : F648: Light, Sample Transmitance : F648: Heavy, Sample
1                                 NA                                 NA
2                           2.00e+05                                 NA
3                           6.46e+08                           3.47e+08
4                                 NA                                 NA
5                           2.72e+06                           1.66e+06
6                                 NA                                 NA
7                                 NA                                 NA
8                           2.58e+07                           1.97e+07
9                           5.38e+06                           4.60e+06

答案 2 :(得分:0)

这应该有效:

    lapply(ldf, function(x) grep("Transmitance : F", names(x), value = T))