我试图通过模式映射提取所需的单词。
以下是对象表中的示例数据
+-----------+-------------------------------------------------------------------------------------------------+ | Unique_Id | Text | +-----------+-------------------------------------------------------------------------------------------------+ | Ax23z12 | Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134 | +-----------+-------------------------------------------------------------------------------------------------+
使用以下代码
regmatches(table[1,2],gregexpr("2000-\\d{4}",table[1,2]))
能够将输出提取为
[[1]]
[1] "2000-0511" "2000-0511"
然而,我正在寻找的输出如下
+-----------+---------------------------------------------------------------------------+-----------+-----------+ | Unique_Id | Text | Column1 | Column2 | +-----------+---------------------------------------------------------------------------+-----------+-----------+ | Ax23z12 | Tool generated code 2015-8134 upon further validation, the tool confirmed | 2015-8134 | 2015-8134 | | | the code as 2015-8134 | | | +-----------+---------------------------------------------------------------------------+-----------+-----------+
文本列下的数据包含此数字多次(最多7次),因此寻找动态解决方案
非常感谢
答案 0 :(得分:3)
这是一种方法。我使用了以下示例数据,称为foo
。
# id text
# <int> <chr>
#1 1 Here is my code, 2015-8134. Here is your code, 2015-1111.
#2 2 His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666
我首先使用stri_extract_all_regex()
为text
提取了数字。这将返回一个矩阵,因此我将其转换为数据框。然后,我使用bind_cols()
将其与原始数据集合并。最后一项工作是修改列名。我使用X
Column
替换了列名中的gsub()
library(dplyr)
library(stringi)
out <- stri_extract_all_regex(str = foo$text, pattern = "\\d+-\\d+", simplify = TRUE) %>%
data.frame(stringsAsFactors = FALSE) %>%
bind_cols(foo,. )
names(out) <- names(out) %>%
gsub(pattern = "X", replacement = "Column")
# id text Column1 Column2 Column3
# <int> <chr> <chr> <chr> <chr>
#1 1 Here is my code, 2015-8134. Here is your code, 2015-1111. 2015-8134 2015-1111
#2 2 His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666 2016-8888 2016-7777 2016-6666
DATA
foo <- structure(list(id = 1:2, text = c("Here is my code, 2015-8134. Here is your code, 2015-1111.",
"His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666"
)), .Names = c("id", "text"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L))
答案 1 :(得分:2)
使用stringr
和data.table
:
1)使用str_match_all
提取所有匹配的模式;
2)使用transpose
将提取的模式转换为列;
3)通过将提取的列与原始列组合来构建新的数据帧;
library(stringr)
library(data.table)
lst = transpose(str_match_all(df$Text, "2015-\\d{4}"))
data.frame(df, setNames(lst, paste0("Column", seq_along(lst))))
# Unique_Id Text Column1 Column2
#1 Ax23z12 Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134 2015-8134 2015-8134
#2 By56m22 Tool generated code 2015-8134 upon further validation 2015-8134 <NA>
答案 2 :(得分:0)
这样的事可能对你有用
df[apply(df, 1, function(x) any(grepl("2000-\\d{4}", x))), ]
请参阅此可重现的示例
iris[apply(iris, 1, function(x) any(grepl("set", x))), ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# etc