Question

我有一个简单的数据框：

df <- data.frame(test = c("test_A_1_1.txt", "test_A_2_1.txt", "test_A_3_1.txt"), value = c(0.51, 0.52, 0.56))

          test   value
1 test_A_1_1.txt  0.51
2 test_A_2_1.txt  0.52
3 test_A_3_1.txt  0.56

预期产量

我想将数字复制到第1列的字符串末尾，并将其分别放在第3列或第4列，如下所示：

          test value  new new
1 test_A_1.txt  0.51   1  1
2 test_A_2.txt  0.52   2  1
3 test_A_3.txt  0.56   3  1

尝试

使用以下代码，我可以从字符串中提取数字：

library(stringr)
as.numeric(str_extract_all("test_A_3.txt", "[0-9]+")[[1]])[1] # Extracts the first number
as.numeric(str_extract_all("test_A_3.txt", "[0-9]+")[[1]])[2] # Extracts the second number

我想将此代码应用于第一列的所有值：

library(tidyverse)
df %>% mutate(new = as.numeric(str_extract_all(df$test, "[0-9]+")[[1]])[1])

但是，这将导致列new的出现，其中仅包含数字1。我在做什么错了？

Answer 1

为什么没有基本的R解决方案？

df$new <- as.numeric(gsub("[^[:digit:]]+", "", df$test))

df
#          test value new
#1 test_A_1.txt  0.51   1
#2 test_A_2.txt  0.52   2
#3 test_A_3.txt  0.56   3

编辑。

以用户@camille的answer中的示例为例，其中字符串可能具有不同数量的数字，这是使用软件包stringr的解决方案。

df1 <- data.frame(test = c("test_A_1.txt", "test_A_2.txt", "test_A_3.txt"), value = c(0.51, 0.52, 0.56))
df2 <- data.frame(test = c("test_A_1_1.txt", "test_A_2_1.txt", "test_A_3_1.txt"), value = c(0.51, 0.52, 0.56))
df3 <- data.frame(test = c("test_A_1_1.txt", "test_A_2_1.txt", "test_A_3_1.txt", "test_A_4_2_1.txt"), value = c(0.51, 0.52, 0.56, 2))

num2cols <- function(DF, col = "test"){
  s <- stringr::str_extract_all(DF[[col]], "[[:digit:]]+")
  Max <- max(sapply(s, length))
  new <- do.call(rbind, lapply(s, function(x){
    as.numeric(c(x, rep(NA, Max - length(x))))
  }))
  names_new <- paste0("new", seq.int(ncol(new)))
  setNames(cbind(DF, new), c(names(DF), names_new))
}

num2cols(df1)
num2cols(df2)
num2cols(df3)

Answer 2

我们可以使用parse_number中的readr

library(dplyr)
library(purrr)
library(stringr)
df %>%
    mutate(new = readr::parse_number(as.character(test)))

关于OP的问题，它仅从list（返回[[1]]）中选择第一个str_extract_all元素（list）。相反，最好使用str_extract，因为我们只需要提取一个或多个数字（\\d+）的第一个实例

df %>%
    mutate(new = as.numeric(str_extract(test, "[0-9]+")))

如果需要从str_extract_all获得输出（以防万一），请将unlist的{{1}}到list，然后在其上应用vector as.numeric

vector

如果存在多个实例，则通过使用df %>% mutate(new = as.numeric(unlist(str_extract_all(test, "[0-9]+"))))遍历list元素，将其转换为numeric后将其保留为list

map

注意：基于df %>% mutate(new = map(str_extract_all(test, "[0-9]+"), as.numeric))的解决方案首先发布在这里。

在str_extract中，我们可以使用base R

regexpr

更新

在更新的示例中，如果我们需要获取两个数字实例，则可以用df$new <- as.numeric(regmatches(df$test, regexpr("\\d+", df$test)))提取第一个实例，而最后一个（str_extract-来自stri_extract_last的实例可以用作很好），通过提供正则表达式环顾四周来检查数字，然后检查stringi和'txt'

Answer 3

稍微修改您现有的代码：

df %>% 
  mutate(new = as.integer(str_extract(test, "[0-9]+")))

或者简单地

df$new <- as.integer(str_extract(df$test, "[0-9]+"))

Answer 4

就像您说的那样，文件名中可能包含多个数字，我建议您使用一种更详细的方法，但可以扩展到一个或两个以上的数字。这样，您就不会像new1和new2这样对列进行硬编码。为了说明这一点，我在文件名之一中添加了第三个数字。

您遇到的最初问题是str_extract_all返回一个列表，然后您需要从该列表中提取项目。您可以取消嵌套该列表，以获取每个数字的单独行，添加一个按顺序排列每个文件名的数字的键，然后扩展为一个宽的形状，以使每个数字显示一列，其中NA表示文件中没有数字名称。

library(dplyr)
library(stringr)
library(tidyr)

df <- data.frame(test = c("test_A_1_1.txt", "test_A_2_1.txt", "test_A_3_1.txt", "test_A_4_2_1.txt"), value = c(0.51, 0.52, 0.56, 2))

df %>%
  mutate(nums = str_extract_all(test, "\\d+")) %>% 
  unnest(nums) %>%
  group_by(test) %>%
  mutate(key = row_number()) %>%
  spread(key, value = nums, sep = "")
#> # A tibble: 4 x 5
#> # Groups:   test [4]
#>   test             value key1  key2  key3 
#>   <fct>            <dbl> <chr> <chr> <chr>
#> 1 test_A_1_1.txt    0.51 1     1     <NA> 
#> 2 test_A_2_1.txt    0.52 2     1     <NA> 
#> 3 test_A_3_1.txt    0.56 3     1     <NA> 
#> 4 test_A_4_2_1.txt  2    4     2     1

Answer 5

鉴于它们的宽度是固定的，您可以：

df$new <- substr(df$test, 8, 8) %>% as.integer

我建议使用as.integer而不是as.numeric，因为您使用的是整数而不是浮点数。

Answer 6

我们还可以使用sub或stringi::stri_extract_last_regex：

sapply(df1, function(x) sub('.*(\\d{1}).*', '\\1', x))

或

sapply(df1, function(x) stringi::stri_extract_last_regex(x, "\\d{1}"))

如何从R的数据框中的字符串中提取数字并将其放置在新列中？

6 个答案:

更新