每当字符出现在data.table对象中时填写一个值

时间:2017-12-27 21:41:17

标签: r data.table

我有一个data.table对象,基本上我想做的是每当出现特定ID_TypeBUYER/SELLER字符值时更新数据表。举个例子,我在这里给出了data.table

ID_Type    |   BUYER   |    SELLER
------------------------------------------------
   1       |           |    Joe
   0       |   Peter   |              
   1       |   Peter   |               
   1       |   Sam     |   
   1       |   Peter   |            
   0       |           |    Mark     
   1       |   Tai     |             
   1       |   Tai     |              
   1       |           |    Mark  

dput输出如下:

structure(list(ID_Type = c("1", "0", "1", "1", "1", "0", "1", 
"1", "1"), BUYER = c(" ", "Peter", "Peter", "Sam", "Peter", " ", 
"Tai", "Tai", " "), SELLER = c("Joe", " ", " ", " ", " ", "Mark", 
" ", " ", "Mark")), .Names = c("ID_Type", "BUYER", "SELLER"), row.names = c(NA, -9L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000009c60788>)

现在,对于特定ID_Type0的{​​{1}}行,BUYER,我希望确保该特定SELLER的每个实例或数据表中的BUYER在后​​续行中有SELLERID_Type。例如,0 Peter在第2行中有BUYER ID_Type,因此每当Peter出现在0列的数据表中后,我想更改每个彼得的BUYERID_Type,同样的事情发生在0马克

基本上,我想要的新数据表应该如下所示:

SELLER

3 个答案:

答案 0 :(得分:4)

这个怎么样

library(data.table)

aaa <- structure(list(ID_Type = c("1", "0", "1", "1", "1", "0", "1", "1", "1"), 
                      BUYER = c(" ", "Peter", "Peter", "Sam", "Peter", " ", "Tai", "Tai", " "), 
                      SELLER = c("Joe", " ", " ", " ", " ", "Mark", " ", " ", "Mark")), 
                 .Names = c("ID_Type", "BUYER", "SELLER"), 
                 row.names = c(NA, -9L), class = c("data.table", "data.frame"))



aaa[BUYER != " ", ID_Type := ID_Type[1], by = BUYER]
aaa[SELLER != " ", ID_Type := ID_Type[1], by = SELLER]
aaa
    #    ID_Type BUYER SELLER
    # 1:       1          Joe
    # 2:       0 Peter       
    # 3:       0 Peter       
    # 4:       1   Sam       
    # 5:       0 Peter       
    # 6:       0         Mark
    # 7:       1   Tai       
    # 8:       1   Tai       
    # 9:       0         Mark

答案 1 :(得分:1)

我会写一个小帮手功能。我还会用真正的缺失值替换你的空格字符串" "

dd[BUYER == " ", BUYER := NA]
dd[SELLER == " ", SELLER := NA]

foo = function(x) {
  if (any(x == 0)) return(rep("0", length(x)))
  return(x)
}
dd[!is.na(BUYER), ID_Type := foo(ID_Type), by = BUYER]
dd[!is.na(SELLER), ID_Type := foo(ID_Type), by = SELLER]
dd
#    ID_Type BUYER SELLER
# 1:       1    NA    Joe
# 2:       0 Peter     NA
# 3:       0 Peter     NA
# 4:       1   Sam     NA
# 5:       0 Peter     NA
# 6:       0    NA   Mark
# 7:       1   Tai     NA
# 8:       1   Tai     NA
# 9:       0    NA   Mark

答案 2 :(得分:0)

虽然OP接受GL_Li's answer显然返回给定样本数据集的预期结果,但我怀疑它是否正确实现了OP的要求。

OP要求(强调我的)

  

对于特定的ID_Type,行中的BUYER 0 时   SELLER,该特定BUYERSELLER的每个实例   数据表在以后的行中有ID_Type 0

如果要严格按照上述规范来反映OP的意图那么GL_Li's answer会失败3点:

  1. 它会更改所有行中的ID_Type,尽管OP已指定仅在以后的行中更改它。
  2. 如果组中ID_Type的第一个值 0,则忽略后续出现的0。
  3. 它还会更改除0之外的其他值(假设ID_Type将变为0或1而不是任何其他值)
  4. 我在示例数据集中添加了几行来演示效果:

    DT2
    
        ID_Type BUYER SELLER
     1:       1          Joe
     2:       0 Peter       
     3:       1 Peter       
     4:       1   Sam       
     5:       1 Peter       
     6:       0         Mark
     7:       1   Tai       
     8:       1   Tai       
     9:       1         Mark
    10:       0   Tai       
    11:       1   Tai       
    12:       2   Sam       
    13:       3   Tom       
    14:       2   Tom
    

    DT2

    上应用GL_Li's answer
    DT2[BUYER != "", ID_Type := ID_Type[1], by = BUYER]
    DT2[SELLER != "", ID_Type := ID_Type[1], by = SELLER]
    DT2
    

    返回

        ID_Type BUYER SELLER
     1:       1          Joe
     2:       0 Peter       
     3:       0 Peter       
     4:       1   Sam       
     5:       0 Peter       
     6:       0         Mark
     7:       1   Tai       
     8:       1   Tai       
     9:       0         Mark
    10:       1   Tai       
    11:       1   Tai       
    12:       1   Sam       
    13:       3   Tom       
    14:       3   Tom
    

    第10,11,12和14行违反了规范,恕我直言。

    替代解决方案

    DT2[, cnt := cumsum(ID_Type == "0"), by = .(BUYER, SELLER)][
      cnt > 0L, ID_Type := "0"][, cnt := NULL]
    DT2
    

    返回

        ID_Type BUYER SELLER
     1:       1          Joe
     2:       0 Peter       
     3:       0 Peter       
     4:       1   Sam       
     5:       0 Peter       
     6:       0         Mark
     7:       1   Tai       
     8:       1   Tai       
     9:       0         Mark
    10:       0   Tai       
    11:       0   Tai       
    12:       2   Sam       
    13:       3   Tom       
    14:       2   Tom
    

    根据规范工作,因为它仅更改后续行中出现的0。

    请注意,上述解决方案基于隐含的假设,即名称仅出现在BUYERSELLER两列中的任意一列中,但两者中都不会出现。

    增强的样本数据集

    library(data.table)
    DT2 <- fread(
      "ID_Type    |   BUYER   |    SELLER
      1       |           |    Joe
      0       |   Peter   |              
      1       |   Peter   |               
      1       |   Sam     |   
      1       |   Peter   |            
      0       |           |    Mark     
      1       |   Tai     |             
      1       |   Tai     |              
      1       |           |    Mark  
      0       |   Tai     |             
      1       |   Tai     |              
      2       |   Sam     |
      3       |   Tom     |
      2       |   Tom     |", 
      sep = "|", colClasses = c(ID_Type = "character"))