我对R还是比较陌生,并且掌握了处理单个数据帧内的数据的习惯。但就我现在的要求而言,我面临的问题如下:
我还有另一个数据帧DD2.df,其中包含以下数据:
我想在DD1.df中添加一列称为“已交付数量”,并从第二个数据框中计算已交付订单数量的值。
请注意,第一个数据框中的“ order.Description”列为非结构化文本,可以为空,并且包含带订单号的详细文本。
有人可以帮我吗?提前致谢!
答案 0 :(得分:3)
您在这里。我们使用str_extract_all
包中的stringr
提取所有订单-由字符串ORD和5位数字定义。请注意,如果其他模式需要定义有效顺序,则需要修改str_extract_all
的第二个参数。 separate_rows
包中的tidyr
用于将多个订单分隔到各自的行中。最后,我们计算总数和已交付的订单数。
df1 <- data.frame(
Country = c("France", "England", "India", "America", "England"),
City = c("Paris", "London", "Mumbai", "Los Angeles", "London"),
Order_Desc = c("No order was placed", "ORD-34212 was the order placed",
"ORD-12252 and ORD-78564 was the order placed",
"The orders placed before 2017 was ORD-56438, ORD-13198
and ORD-12258", "The order was ORD-34567"),
stringsAsFactors = FALSE
)
df2 <- data.frame(
OrderNo = c("ORD-34212", "ORD-12252", "ORD-78564", "ORD-56438",
"ORD-13198", "ORD-12258", "ORD-34567"),
Status = c("Delivered", "Not delivered", "Not delivered",
"Delivered", "Not delivered", "Delivered", "Delivered"),
stringsAsFactors = FALSE
)
library(stringr)
library(dplyr)
library(tidyr)
df1g <- df1 %>%
group_by(Country, City) %>%
mutate(
orders = paste(str_extract_all(Order_Desc, "ORD-\\d{5}", simplify = TRUE),
collapse = "|")
) %>%
distinct(Country, City, orders) %>%
separate_rows(orders, sep = "[|]") %>%
left_join(df2, by = c("orders" = "OrderNo"))
df1s <- df1g %>%
group_by(Country, City) %>%
summarise(
total_orders = sum(!is.na(Status)),
delivered_orders = sum(Status == "Delivered")
)