从一个数据帧到另一个数据帧获取复杂数据

时间:2018-07-11 08:51:19

标签: r

我对R还是比较陌生,并且掌握了处理单个数据帧内的数据的习惯。但就我现在的要求而言,我面临的问题如下:

  1. 我有一个数据帧DD1.df,其中包含以下数据:

First Dataframe

我还有另一个数据帧DD2.df,其中包含以下数据:

Second Dataframe

我想在DD1.df中添加一列称为“已交付数量”,并从第二个数据框中计算已交付订单数量的值。

请注意,第一个数据框中的“ order.Description”列为非结构化文本,可以为空,并且包含带订单号的详细文本。

有人可以帮我吗?提前致谢!

1 个答案:

答案 0 :(得分:3)

您在这里。我们使用str_extract_all包中的stringr提取所有订单-由字符串ORD和5位数字定义。请注意,如果其他模式需要定义有效顺序,则需要修改str_extract_all的第二个参数。 separate_rows包中的tidyr用于将多个订单分隔到各自的行中。最后,我们计算总数和已交付的订单数。

df1 <- data.frame(
  Country = c("France", "England", "India", "America", "England"),
  City = c("Paris", "London", "Mumbai", "Los Angeles", "London"),
  Order_Desc = c("No order was placed", "ORD-34212 was the order placed",
                 "ORD-12252 and ORD-78564 was the order placed",
                 "The orders placed before 2017 was ORD-56438, ORD-13198
                 and ORD-12258", "The order was ORD-34567"),
  stringsAsFactors = FALSE
  )
df2 <- data.frame(
  OrderNo = c("ORD-34212", "ORD-12252", "ORD-78564", "ORD-56438",
              "ORD-13198", "ORD-12258", "ORD-34567"),
  Status = c("Delivered", "Not delivered", "Not delivered",
             "Delivered", "Not delivered", "Delivered", "Delivered"),
  stringsAsFactors = FALSE
)

library(stringr)
library(dplyr)
library(tidyr)
df1g <- df1 %>%
  group_by(Country, City) %>%
  mutate(
    orders = paste(str_extract_all(Order_Desc, "ORD-\\d{5}", simplify = TRUE),
                   collapse = "|")
  ) %>%
  distinct(Country, City, orders) %>%
  separate_rows(orders, sep = "[|]") %>%
  left_join(df2, by = c("orders" = "OrderNo"))
df1s <- df1g %>%
  group_by(Country, City) %>%
  summarise(
    total_orders = sum(!is.na(Status)),
    delivered_orders = sum(Status == "Delivered")
  )