输入数据框有三个id列和一个raw_text。 u_id对应于用户,doc_id对应于特定用户的文档,而句子id对应于用户的文档内的句子。
branch
假设我们需要从raw_text中提取所有日期及其位置。到目前为止我的方法 -
df <- data.frame(u_id=c(1,1,1,1,1,2,2,2),
doc_id=c(1,1,1,2,2,1,1,2),
sent_id=c(1,2,3,1,2,1,2,1),
text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:",
"pertinent results: 2105-4-16 05:02pm gap-14
2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
"method exists and the former because calls to the corresponding",
"admission date: 2001-4-19 discharge date: 2002-5-23 service:",
"pertinent results: 2105-4-16 05:02pm gap-14
2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
"method exists and the former because calls to the corresponding",
"method exists and the former because calls to the corresponding",
"method exists and the former because calls to the corresponding"))
日期提取采用data.frame格式。是否有一种方法将string_locations放在与其id和字符串对应的data.frame格式中?理想情况下,输出应为 -
#define a regex for date
date<-"([0-9]{2,4})[- . /]([0-9]{1,4})[- . /]([0-9]{2,4})"
#library
library(dplyr)
library(stringr)
#extract dates
df_i<-df %>%
mutate(i=str_extract_all(text,date)) %>%
mutate(date=lapply(i, function(x) if(identical(x, character(0))) NA_character_ else x)) %>%
unnest(date)
#extract date locations
df_ii<-str_locate_all(df$text,date)
n<-max(sapply(df_ii, nrow))
date_loc<-as.data.frame(do.call(rbind, lapply(df_ii, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x))))))
答案 0 :(得分:4)
你可以这样做:
regex = "\\b[0-9]+[-][0-9]+[-][0-9]+\\b"
df_i = str_extract_all(df$text, regex)
df_ii = str_locate_all(df$text, regex)
output1 = Map(function(x, y, z){
if(length(y) == 0){
y = NA
}
if(nrow(z) == 0){
z = rbind(z, list(start = NA, end = NA))
}
data.frame(id = x, date = y, z)
}, df$id, df_i, df_ii) %>%
do.call(rbind,.) %>%
merge(df, .)
或坚持使用仅管道语法:
regex = "[0-9]+[-][0-9]+[-][0-9]+"
output1 = df %>%
{list(.$id, str_extract_all(.$text, regex),
str_locate_all(.$text, regex))} %>%
{Map(function(x, y, z){
if(length(y) == 0){
y = NA
}
if(nrow(z) == 0){
z = rbind(z, list(start = NA, end = NA))
}
data.frame(id = x, date = y, z)
}, .[[1]], .[[2]], .[[3]])} %>%
do.call(rbind, .) %>%
merge(df, .)
<强>结果:强>
id
1 1
2 1
3 2
4 2
5 2
6 3
text
1 admission date: 2001-4-19 discharge date: 2002-5-23 service:
2 admission date: 2001-4-19 discharge date: 2002-5-23 service:
3 pertinent results: 2105-4-16 05:02pm gap-14 \n 2105-4-16 04:23pm rdw-13.1 2105-4-16 .
4 pertinent results: 2105-4-16 05:02pm gap-14 \n 2105-4-16 04:23pm rdw-13.1 2105-4-16 .
5 pertinent results: 2105-4-16 05:02pm gap-14 \n 2105-4-16 04:23pm rdw-13.1 2105-4-16 .
6 method exists and the former because calls to the corresponding
date start end
1 2001-4-19 17 25
2 2002-5-23 43 51
3 2105-4-16 20 28
4 2105-4-16 77 85
5 2105-4-16 104 112
6 <NA> NA NA
备注:强>
[- . /]
中添加了空格。 date<-"([0-9]{2,4})[-./]([0-9]{1,4})[-./]([0-9]{2,4})"
应该这样做。 mutate
允许您使用刚刚在同一函数调用中创建的变量,因此不需要为mutate
使用两个单独的df_i
。{}
和list()
周围需要Map()
来覆盖dplyr默认值,将输出从前一步骤输入到 first < / em>下一个函数的参数。 例如:
df %>%
list(.$id, str_extract_all(.$text, regex),
str_locate_all(.$text, regex))
变为:
list(df, df$id, str_extract_all(df$text, regex),
str_locate_all(df$text, regex))
这不是我们想要的。
<强>编辑:强>
OP更新了他的df
,以包含text
不包含任何dates
的行。这会导致我的原始解决方案失败,因为str_extract_all
和str_locate_all
列表中的某些元素会有length(0)
和nrow(0)
。我通过添加两个if
语句解决了这个问题:
if(length(y) == 0){
y = NA
}
if(nrow(z) == 0){
z = rbind(z, list(start = NA, end = NA))
}
这会使dates = "NA
为那些没有日期的行添加一行NA到start
和end
。这允许id
在data.frame
步骤中绑定一行。