从文件名中提取字符串并使用mutate创建新列

时间:2018-04-24 15:28:37

标签: r dplyr stringr mutate

我有一个包含两列的data.frame。在第二列中是文件名。

= INDEX(Sheet2!A2:A11,MATCH(Sheet1!Q3,INDEX(Sheet2!$2:$11,0,ROW()-2),0)+0)

如何从第二列中提取某些字符串(使用df <- data.frame(paragraph = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.", filename = "./data/RevCon_2015_C1_Austria_05_06.txt", stringsAsFactors = FALSE) )并将其添加(使用stringr)作为其他变量(会议,年份,国家/地区等),以便我获得结果如下:

dplyr::mutate

2 个答案:

答案 0 :(得分:2)

我们可以使用tidyr::separate执行以下操作:

library(tidyverse);
df %>%
    mutate(tmp = gsub("(\\./data/|\\.txt)", "", filename)) %>%
    separate(
        tmp,
        into = c("conference", "year", "ignored", "country", "month", "day")) %>%
    mutate(date = paste(day, month, year, sep = "/")) %>%
    select(-ignored, -month, -day)
#          paragraph                                filename conference year
#1 Lorem ipsum [...] ./data/RevCon_2015_C1_Austria_05_06.txt     RevCon 2015
#  country        date
#1 Austria  06/05/2015

请注意,这假设filename符合以下模式:./data/{conference}_{year}_{ignored}_{country}_{month}_{day}.txt

样本数据

df  <- data.frame(
    paragraph = "Lorem ipsum [...]",
    filename = "./data/RevCon_2015_C1_Austria_05_06.txt",
    stringsAsFactors = FALSE)

答案 1 :(得分:0)

以下是使用separate中的extracttidyr的两种不同方法:

library(dplyr)
library(tidyr)

df %>%
  mutate(filename2 = gsub("^(\\w+)_(\\d+)_.+?_(\\w+)_(\\d{2})_(\\d{2}).+$", 
                          "\\1_\\2_\\3_\\5.\\4.\\2", basename(filename))) %>%
  separate(filename2, c("conference", "year", "country", "date"), sep = "_")

extract

df %>%
  extract(filename, c("conference", "year", "country", "day", "month"),
          "^.+/(\\w+)_(\\d+)_.+?_(\\w+)_(\\d{2})_(\\d{2}).+$",
          remove = FALSE) %>%
  unite(date, month, day, year, sep = ".", remove = FALSE) %>%
  select(paragraph, filename, conference, year, country, date)

<强>结果:

                                                                   paragraph
1 Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
                                 filename conference year country       date
1 ./data/RevCon_2015_C1_Austria_05_06.txt     RevCon 2015 Austria 06.05.2015

备注:

  1. 第一种方法使用gsub来匹配每个&#34;列&#34;我们想要使用捕获组,并根据需要重新排序。请注意,添加了_以区分列
    • 我使用basename函数提取了最后/
    • 之后的所有内容 然后使用
    • separate将元素拆分为实际列,其中_为分隔符
  2. 第二种方法使用相同的正则表达式,但extract不是重新排列,而是将每个捕获组视为单独的列
    • unitemonthdayyear绑定在一起,而不删除原始列
    • 最后select删除daymonth并重新排列列顺序