我有一个包含两列的data.frame。在第二列中是文件名。
= INDEX(Sheet2!A2:A11,MATCH(Sheet1!Q3,INDEX(Sheet2!$2:$11,0,ROW()-2),0)+0)
如何从第二列中提取某些字符串(使用df <- data.frame(paragraph = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
filename = "./data/RevCon_2015_C1_Austria_05_06.txt", stringsAsFactors = FALSE)
)并将其添加(使用stringr
)作为其他变量(会议,年份,国家/地区等),以便我获得结果如下:
dplyr::mutate
答案 0 :(得分:2)
我们可以使用tidyr::separate
执行以下操作:
library(tidyverse);
df %>%
mutate(tmp = gsub("(\\./data/|\\.txt)", "", filename)) %>%
separate(
tmp,
into = c("conference", "year", "ignored", "country", "month", "day")) %>%
mutate(date = paste(day, month, year, sep = "/")) %>%
select(-ignored, -month, -day)
# paragraph filename conference year
#1 Lorem ipsum [...] ./data/RevCon_2015_C1_Austria_05_06.txt RevCon 2015
# country date
#1 Austria 06/05/2015
请注意,这假设filename
符合以下模式:./data/{conference}_{year}_{ignored}_{country}_{month}_{day}.txt
df <- data.frame(
paragraph = "Lorem ipsum [...]",
filename = "./data/RevCon_2015_C1_Austria_05_06.txt",
stringsAsFactors = FALSE)
答案 1 :(得分:0)
以下是使用separate
中的extract
和tidyr
的两种不同方法:
library(dplyr)
library(tidyr)
df %>%
mutate(filename2 = gsub("^(\\w+)_(\\d+)_.+?_(\\w+)_(\\d{2})_(\\d{2}).+$",
"\\1_\\2_\\3_\\5.\\4.\\2", basename(filename))) %>%
separate(filename2, c("conference", "year", "country", "date"), sep = "_")
或extract
:
df %>%
extract(filename, c("conference", "year", "country", "day", "month"),
"^.+/(\\w+)_(\\d+)_.+?_(\\w+)_(\\d{2})_(\\d{2}).+$",
remove = FALSE) %>%
unite(date, month, day, year, sep = ".", remove = FALSE) %>%
select(paragraph, filename, conference, year, country, date)
<强>结果:强>
paragraph
1 Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
filename conference year country date
1 ./data/RevCon_2015_C1_Austria_05_06.txt RevCon 2015 Austria 06.05.2015
备注:强>
gsub
来匹配每个&#34;列&#34;我们想要使用捕获组,并根据需要重新排序。请注意,添加了_
以区分列
basename
函数提取了最后/
separate
将元素拆分为实际列,其中_
为分隔符extract
不是重新排列,而是将每个捕获组视为单独的列
unite
将month
,day
和year
绑定在一起,而不删除原始列select
删除day
和month
并重新排列列顺序