简化多个行计算

时间:2017-01-22 08:49:48

标签: r dplyr

我有一个数据框,其中包含有关特定产品在一个时期内记录的数量和预订的基本描述性统计数据(总和,平均值,最大值和计数)。该产品可能是新产品或旧产品。

我用来表示示例中的列的符号如下:

符号:

N and O = New and Old respectively
P1 and P2 = Period 1 and Period2 respectively
B and Q = Bookings and Quantity respectively
S and A = Sum and Average respectively.
P1 = Period 1
P2 = Period 2
P12 = Period 1 + Period 2 = Total Period
n = count 
M = Max

因此,N_P2_B_A意味着 N 新产品(N)以 P eriod 2 P2 )销售并计算 B ookings(B) A verage(A)。类似地,P12_Q_A将意味着在 P eriod(P12)中出售的 A verage(A) Q uantity(Q)。

我这样做是为了缩短列名并尽量减少记住它们的负担。自动完成不适用于RStudio中的列名。

我想做什么: 我的代码的目标是计算Comparison_p1_p2,其中我将添加以下四个逻辑的输出。对于他们每个人,如果是,则答案为TRUE,如果不是,则为FALSE

a)P2中的预订平均值>预订平均P1 + P2(即总期);

b)P2中的数量平均值>预订平均P1 + P2(即总期)

c)P2中的最大预订数量>预订平均P1 + P2(即总期)

d)P2中的最大数量是否> P1 + P2的数量平均值(即总周期)

同样,我会对P2中销售的新旧产品进行比较。我将此专栏称为Comparison_p2new_P2old。我会添加以下输出:

a)预订平均来自于在P2中销售新产品吗?在P2销售旧产品的平均预订量是什么?

b)在P2中销售新产品的数量平均值> P2销售旧产品的平均数量?

我在哪里需要您的帮助 现在,我需要帮助来简化此代码。虽然我的代码有效,但我不确定如何通过使用向量化操作来简化它。我最近从C ++ / Java转换,所以使用R的矢量化对我来说真的很难。

输入文件:

dput(Master_Final)
structure(list(Country_SL6 = c("United States", "United States", 
"United States", "United States", "United Kingdom", "United Kingdom"
), Company.Name = c("Mass Incorp.", "Mass Incorp.", "Mass Incorp.", 
"Mass Incorp.", "Texan Incorp.", "Texan Incorp."), Family_Type = c("N", 
"N", "O", "O", "O", "O"), Ship_Period = c("P1", "P2", "P1", "P2", 
"P1", "P2"), Q_S = c(1, 15, 4633, 57317, 251, 1205), B_S = c(1157, 
26958.4, 3736290.43, 6144393.02, 171699, 1022155.1), Q_A = c(1, 
2.14285714285714, 71.2769230769231, 707.617283950617, 25.1, 34.4285714285714
), B_A = c(1157, 3851.2, 57481.3912307692, 75856.7039506173, 
17169.9, 29204.4314285714), Q_M = c(1, 8, 1940, 11000, 234, 617
), B_M = c(1157, 17980, 1270354, 1463415.25, 128258, 341293.55
), n = c(1, 7, 65, 81, 10, 35)), .Names = c("Country_SL6", "Company.Name", 
"Family_Type", "Ship_Period", "Q_S", "B_S", "Q_A", "B_A", "Q_M", 
"B_M", "n"), row.names = c(NA, 6L), class = "data.frame")

示例输出: 或者,您可以在下面运行我的代码来生成示例输出。

dput(Top_Act_Final)
structure(list(Country_SL6 = c("United Kingdom", "United States"
), Company.Name = c("Texan Incorp.", "Mass Incorp."), Comparison_p1_p2 = c(4L, 
4L), Comparison_p2new_P2old = c(NA, 0L)), class = c("rowwise_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -2L), .Names = c("Country_SL6", 
"Company.Name", "Comparison_p1_p2", "Comparison_p2new_P2old"))

代码:

Top_Act_Final<- Master_Final %>%
  #Spread out the columns so that we can do row-wise operations
  gather(key = Column, value = value,Q_S:n) %>%
  unite(P_Column, Ship_Period,Column,sep="_") %>%
  unite(F_P_Column, Family_Type,P_Column,sep="_") %>%
  spread(F_P_Column, value) %>%
  dplyr::rowwise(.)%>%
  #Notation:
  #N vs. O = New or Old
  #P1 vs. P2 = Period 1 vs. Period2
  #B vs. Q = Bookings vs. Quantity
  #S vs. A = Sum vs. Average

  dplyr::mutate(P12_B_S = sum(N_P1_B_S,O_P1_B_S,N_P2_B_S,O_P2_B_S,na.rm=TRUE),
                P12_Q_S = sum(N_P1_Q_S,O_P1_Q_S,N_P2_Q_S,O_P2_Q_S,na.rm=TRUE),
                P12_B_A = sum(N_P1_B_A*N_P1_n,  O_P1_B_A*O_P1_n,  N_P2_B_A*N_P2_n,  O_P2_B_A*O_P2_n,  na.rm = TRUE)/sum(N_P1_n,O_P1_n,N_P2_n,O_P2_n,na.rm=TRUE),
                P12_Q_A = sum(N_P1_B_A*N_P1_n,  O_P1_B_A*O_P1_n,  N_P2_B_A*N_P2_n,  O_P2_B_A*O_P2_n,  na.rm = TRUE)/sum(N_P1_n,O_P1_n,N_P2_n,O_P2_n,na.rm=TRUE),

                P2_B_A = sum(N_P2_B_A*N_P2_n,  O_P2_B_A*O_P2_n,  na.rm = TRUE)/sum(N_P2_n, O_P2_n,na.rm=TRUE),
                P2_Q_A = sum( N_P2_Q_A*N_P2_n,  O_P2_Q_A*O_P2_n,  na.rm = TRUE)/sum(N_P2_n, O_P2_n,na.rm=TRUE),
                P1_B_A = sum(N_P1_B_A*N_P1_n,  O_P1_B_A*O_P1_n,  na.rm = TRUE)/sum(N_P1_n, O_P1_n,na.rm=TRUE),
                P1_Q_A = sum( N_P1_Q_A*N_P1_n,  O_P1_Q_A*O_P1_n,  na.rm = TRUE)/sum(N_P1_n, O_P1_n,na.rm=TRUE),
                P1_B_S = sum(N_P1_B_S,  O_P1_B_S,  na.rm = TRUE),                  
                P1_Q_S = sum(N_P1_Q_S,  O_P1_Q_S,  na.rm = TRUE),                  

                P2_B_S = sum(N_P2_B_S,  O_P2_B_S,  na.rm = TRUE),                  
                P2_Q_S = sum(N_P2_Q_S,  O_P2_Q_S,  na.rm = TRUE),      

                P2_B_M = max(N_P2_B_M,O_P2_B_M, na.rm = TRUE),
                P2_Q_M = max(N_P2_Q_M,O_P2_Q_M, na.rm = TRUE),
                P1_n = sum(N_P1_n,O_P1_n,na.rm = TRUE),
                P2_n = sum(N_P2_n,O_P2_n,na.rm = TRUE))

#replace NaN or -Inf with NA
Top_Act_Final[is.nan(Top_Act_Final$P2_B_A),"P2_B_A"]<-NA
Top_Act_Final[is.nan(Top_Act_Final$P2_Q_A),"P2_Q_A"]<-NA
Top_Act_Final[is.nan(Top_Act_Final$P2_B_M),"P2_B_M"]<-NA
Top_Act_Final[is.nan(Top_Act_Final$P2_Q_M),"P2_Q_M"]<-NA

Top_Act_Final<-     Top_Act_Final %>%

  dplyr::mutate(P12_B_A =sum(P1_B_A*P1_n,P2_B_A*P2_n,na.rm=TRUE)/(P1_n+P2_n),
                P12_Q_A =sum(P1_Q_A*P1_n,P2_Q_A*P2_n,na.rm=TRUE)/(P1_n+P2_n),
                P12_B_S =sum(P1_B_S,P2_B_S,na.rm=TRUE),
                P12_Q_S =sum(P2_Q_S,P1_Q_S,na.rm=TRUE)) %>%

  dplyr::mutate(B_A_c = P2_B_A>P12_B_A,
                Q_A_c = P2_Q_A>P12_Q_A,
                B_M_c = P2_B_M>P12_B_A,
                Q_M_c = P2_Q_M>P12_Q_A,
                N_O_A_c = N_P2_B_A>O_P2_B_A,
                N_O_S_c = N_P2_B_S>O_P2_B_S) %>%

  dplyr::mutate(Comparison_p1_p2 = sum(B_A_c, Q_A_c, B_M_c, Q_M_c, na.rm = TRUE),
                Comparison_p2new_P2old = sum(N_O_A_c,N_O_S_c)) %>%

  dplyr::select(Country_SL6, Company.Name, Comparison_p1_p2, Comparison_p2new_P2old)

很抱歉这篇长篇文章,但我希望尽可能详细。反正我们可以缩短我的代码吗?

我上周写了这段代码,今天在审阅它时(重复使用它),我度过了一段艰难的时光。我非常感谢你的帮助。

0 个答案:

没有答案