在R中有效的方法,可以使用庞大的数据集

时间:2018-04-02 11:21:16

标签: r dataframe dplyr dataset data.table

我真的需要加速一些R代码。我有一个特定运动的大型数据集。数据框中的每一行代表游戏中的某种类型的动作。对于每个游戏(game_id),我们有两个参与游戏的团队(team_id)。数据框中的time_ref是每个游戏按时间顺序排列的动作。 type_id是游戏中的动作类型。 player_off设置为TRUEFALSE,并且与action_id=3相关联。 action_id=3表示获得卡片的玩家,player_off设置为TRUE / FALSE,如果玩家在获得该卡时被罚下。示例data.frame:

> df

game_id team_id action_id   player_off  time_ref
100     10         1             NA       1000
100     10         1             NA       1001
100     10         1             NA       1002
100     11         1             NA       1003
100     11         2             NA       1004
100     11         1             NA       1005
100     10         3             1        1006
100     11         1             NA       1007
100     10         1             NA       1008
100     10         1             NA       1009
101     12         3             0        1000
101     12         1             NA       1001
101     12         1             NA       1002
101     13         2             NA       1003
101     13         3             1        1004
101     12         1             NA       1005
101     13         1             NA       1006
101     13         1             NA       1007
101     12         1             NA       1008
101     12         1             NA       1009

我需要的是数据框中的另一列,它给我TRUEFALSE两个球队在场上是否有相同/不等数量的球员,而每个动作(排)都发生了。

所以game_id=100action_id=3& player_off=1 team_id=10 time_ref=1006 time_ref>1006。所以我们知道球队在场上的数量与场上球员的数量相等,但在比赛剩余时间内不相等(game_id=101)。 >df game_id team_id action_id player_off time_ref is_even 100 10 1 NA 1000 1 100 10 1 NA 1001 1 100 10 1 NA 1002 1 100 11 1 NA 1003 1 100 11 2 NA 1004 1 100 11 1 NA 1005 1 100 10 3 1 1006 1 100 11 1 NA 1007 0 100 10 1 NA 1008 0 100 10 1 NA 1009 0 101 12 3 0 1000 1 101 12 1 NA 1001 1 101 12 1 NA 1002 1 101 13 2 NA 1003 1 101 13 3 1 1004 1 101 12 1 NA 1005 0 101 13 1 NA 1006 0 101 13 1 NA 1007 0 101 12 1 NA 1008 0 101 12 1 NA 1009 0 也发生了同样的事情。

这是一个数据框的示例,我想为数据集添加一列额外的列。

game_id=100

因此,您可以在time_ref=1006中看到某位玩家已在is_even=1被发送,因此之前的所有行都标记为0,之后标记为不均匀或game_id=101。类似于time_ref=1004的{​​{1}}。

实现此额外列的最有效方法是什么?优选不使用for循环。

3 个答案:

答案 0 :(得分:5)

对于某些载体

x = c(0, NA, NA, NA, 1, NA, NA, NA)

编写一个功能来标准化数据(0或1个玩家丢失),计算累计失去的玩家数量,并将其与零进行比较,

fun0 = function(x)  {
    x[is.na(x)] = 0
    cumsum(x) == 0
}

对于多个组,请将ave()与分组变量

一起使用
x = c(x, rev(x))
grp = rep(1:2, each = length(x) / 2)
ave(x, grp, FUN = fun0)

对于问题中的数据,请尝试

df$is_even = ave(df$player_off, df$game_id, FUN = fun)

从语义上讲,fun0()似乎比这个解决方案中隐含的更复杂,特别是如果每​​支球队输掉一名球员,他们就会像@SunLisa所说的那样更加平庸。如果是,请清理数据

df$player_off[is.na(df$player_off)] = 0

并更改fun0(),例如,

fun1 <- function(x, team) {
    is_team_1 <- team == head(team, 1) # is 'team' the first team?
    x1 <- x & is_team_1                # lost player & team 1
    x2 <- x & !is_team_1               # lost player & team 2
    cumsum(x1) == cumsum(x2)           # same total number of players?
}

(将逻辑返回值强制转换为整数似乎不是一个好主意)。这可以通过

组来应用
df$is_even = ave(seq_len(nrow(df)), df$game_id, FUN = function(i) {
    fun1(df$player_off[i], df$team_id[i])
})

split(df$is_even, df$game_id) <-
    Map(fun1,
        split(df$player_off, df$game_id),
        split(df$team_id, df$game_id)
    )

ave()的实现有助于查看,重要的一行是

split(x, g) <- lapply(split(x, g), FUN)

右侧按组x拆分g,然后将FUN()应用于每个组。左侧split<-()是一项棘手的操作,使用组索引更新原始向量x

评论

原始问题要求&#39; no for循环&#39;,但实际上lapply()ave()}和Map()就是这样; ave()是相对有效的,因为它采用了分割 - 应用 - 组合策略,而不是OP可能实现的,可能通过游戏迭代,数据框的子集,然后更新每个游戏的data.frame 。子集将具有整个数据集的重复子集,并且特别是更新将至少复制每个赋值的整个结果列;这种复制会大大减慢执行速度。 OP也可能与fun0()挣扎;它将有助于澄清问题,特别是标题,以确定这是问题。

有更快的方法,特别是使用data.table包,但原理是相同的 - 确定一个按照你喜欢的方式对矢量进行操作的函数,并按组应用它。

另一个完全向量化的解决方案在this suggestion之后按组计算累积总和。对于fun0(),将x标准化为在特定时间点离开游戏的玩家数量,无需NAs

x[is.na(x)] = 0

相当于fun(),计算离开游戏的玩家的累积总和,无论小组如何

cs = cumsum(x)

对累积和应用于

的组更正此问题
in_game = cs - (grp - 1)

并将其设置为&#39; TRUE&#39;当0名玩家离开游戏时

is_even = (in_game == 0)

这依赖于从{1}到1组数量的grp索引;对于这里的数据,可能grp = match(df$game_id, unique(df$game_id))fun1()存在类似的解决方案。

答案 1 :(得分:2)

这是针对该问题的dplyr + tidyr解决方案,其中包含所做内容的摘要:

  1. 通过将player_off中的所有NA转换为0来处理数据,以便更轻松地进行求和,并将较小的team_num(假设只有2个)分配给team1,将另一个分配给{{1 }}
  2. 使用team2“标记”player_off并使用0填充数据中的无效组合 - 例如,在spread = 100中,没有game_id team_id = 1000
  3. = 11
  4. 获取time_ref ged lagteam1向量的累积总和(当然,填充NAs为0)
  5. 以下代码:

    team2

    输出:

    require(dplyr)
    require(tidyr)
    
    df %>%
      group_by(game_id) %>%
      mutate(
        player_off = player_off %>% replace(list = is.na(.), values = 0),
        team_num = if_else(team_id == min(team_id), "team1", "team2")
      ) %>%
      spread(key = team_num, value = player_off, fill = 0) %>%
      arrange(game_id, time_ref) %>%
      mutate(
        team1_cum = cumsum(lag(team1, default = 0)),
        team2_cum = cumsum(lag(team2, default = 0)),
        is_even = as.integer(team1_cum == team2_cum)
      ) %>%
      ungroup() %>%
      select(-team1, -team2, -team1_cum, -team2_cum)
    

答案 2 :(得分:2)

这是我的想法:

data.table可以很好地工作,尤其是在处理大型数据集时。它更快。我们只需要对其进行分组,cumsum 2团队的裁员,看看他们是否相同。

首先我要说:

(Martin Morgan解决了问题,他的更新答案不再出现此错误)

我不认为@Martin Morgan的回答是正确的。让我们想象某个案例:

当第1队关闭一名球员,之后球队2关闭另一名球员,然后2队应该是平均,但@Martin Morgan的输出将是FALSE

我将使用此数据集做一个示例,其中player_off的{​​{1}}被修改为record 19,这意味着在游戏1中,101之后1}} team 13 1 player off 1004 team 12 1 player off 1008 1009 > dt.1 game_id team_id action_id player_off time_ref 1 100 10 1 NA 1000 2 100 10 1 NA 1001 3 100 10 1 NA 1002 4 100 11 1 NA 1003 5 100 11 2 NA 1004 6 100 11 1 NA 1005 7 100 10 3 1 1006 8 100 11 1 NA 1007 9 100 10 1 NA 1008 10 100 10 1 NA 1009 11 101 12 3 0 1000 12 101 12 1 NA 1001 13 101 12 1 NA 1002 14 101 13 2 NA 1003 15 101 13 3 1 1004 16 101 12 1 NA 1005 17 101 13 1 NA 1006 18 101 13 1 NA 1007 19 101 12 1 1 1008 20 101 12 1 NA 1009 > dt.1$is_even = ave(df$player_off, df$game_id, FUN = fun) > dt.1 game_id team_id action_id player_off time_ref is_even 1 100 10 1 NA 1000 1 2 100 10 1 NA 1001 1 3 100 10 1 NA 1002 1 4 100 11 1 NA 1003 1 5 100 11 2 NA 1004 1 6 100 11 1 NA 1005 1 7 100 10 3 1 1006 1 8 100 11 1 NA 1007 0 9 100 10 1 NA 1008 0 10 100 10 1 NA 1009 0 11 101 12 3 0 1000 1 12 101 12 1 NA 1001 1 13 101 12 1 NA 1002 1 14 101 13 2 NA 1003 1 15 101 13 3 1 1004 1 16 101 12 1 NA 1005 0 17 101 13 1 NA 1006 0 18 101 13 1 NA 1007 0 19 101 12 1 1 1008 0 20 101 12 1 NA 1009 0 。{/}} p>

line 19

但@Martin Morgan的功能会产生这个输出:

line 20

请注意is.even=0NANA的方式。这不是op想要的。

我的代码未处理0,因此我首先要将> dt.1<-as.data.table(dt.1) > dt.1[is.na(dt.1)]<-0 转换为1008

1009

我的代码会在时间team 12team 13生成正确的输出,其中> dt.1[,.(action_id,team2_off=(team_id==max(team_id))*player_off,team1_off=(team_id==min(team_id))*player_off,team_id,time_ref,player_off),by=game_id][order(game_id,time_ref)][,.(team_id,time_ref,action_id,player_off,even=as.numeric(cumsum(team2_off)==cumsum(team1_off))),by=game_id] game_id team_id time_ref action_id player_off even 1: 100 10 1000 1 0 1 2: 100 10 1001 1 0 1 3: 100 10 1002 1 0 1 4: 100 11 1003 1 0 1 5: 100 11 1004 2 0 1 6: 100 11 1005 1 0 1 7: 100 10 1006 3 1 0 8: 100 11 1007 1 0 0 9: 100 10 1008 1 0 0 10: 100 10 1009 1 0 0 11: 101 12 1000 3 0 1 12: 101 12 1001 1 0 1 13: 101 12 1002 1 0 1 14: 101 13 1003 2 0 1 15: 101 13 1004 3 1 0 16: 101 12 1005 1 0 0 17: 101 13 1006 1 0 0 18: 101 13 1007 1 0 0 19: 101 12 1008 1 1 1 20: 101 12 1009 1 0 1 dt[, .( action_id, team2_off = (team_id == max(team_id)) * player_off, team1_off = (team_id == min(team_id)) * player_off, team_id, time_ref, player_off ), by = game_id][order(game_id, time_ref)][, .(team_id, time_ref, action_id, player_off, even = cumsum(team2_off) == cumsum(team1_off)), by = game_id] 都有1次关闭,两个小组是偶数。

dt

我理解这是一个看起来很混乱的data.table代码,让我一步一步解释。

game_id

首先,我们采用data.table team2_off = (team_id == max(team_id)) * player_off, team1_off = (team_id == min(team_id)) * player_off ,按game_id分组,并执行此计算:

team_id

data.table在同时进行2次分组时遇到了一些问题(按team1_offteam2_off分组),但它处理每个组内部的逻辑表达式。通过这种方式,我们可以通过将team_id == max/min(team_id)的逻辑输出与player_off相乘来有效地获得> dt.1[,.(action_id,team2_off=(team_id==max(team_id))*player_off,team1_off=(team_id==min(team_id))*player_off,team_id,time_ref,player_off),by=game_id] game_id action_id team2_off team1_off team_id time_ref player_off 1: 100 1 0 0 10 1000 0 2: 100 1 0 0 10 1001 0 3: 100 1 0 0 10 1002 0 4: 100 1 0 0 11 1003 0 5: 100 2 0 0 11 1004 0 6: 100 1 0 0 11 1005 0 7: 100 3 0 1 10 1006 1 8: 100 1 0 0 11 1007 0 9: 100 1 0 0 10 1008 0 10: 100 1 0 0 10 1009 0 11: 101 3 0 0 12 1000 0 12: 101 1 0 0 12 1001 0 13: 101 1 0 0 12 1002 0 14: 101 2 0 0 13 1003 0 15: 101 3 1 0 13 1004 1 16: 101 1 0 0 12 1005 0 17: 101 1 0 0 13 1006 0 18: 101 1 0 0 13 1007 0 19: 101 1 0 1 12 1008 1 20: 101 1 0 0 12 1009 0 team_id。当两者都是1时,输出将为1,这意味着在所选团队中有1名玩家​​关闭。

现在我们有一个数据表:

game_id

现在我们不再需要按两个组进行分组(cumsumgame_id),我们可以按cumsum(team1_off)==cumsum(team2_off)进行order,并比较game_id },time_ref NA0,所以结果会有正确的顺序。

我了解dummy在此方案中可能与player_off具有不同的含义。如果您真的非常关心,只需创建一个> dt$dummy<-dt$player_off > dt$dummy[is.na(dt$dummy)]<-0 > dt<-as.data.table(dt) > dt[, .( + action_id, + team2_off = (team_id == max(team_id)) * dummy, + team1_off = (team_id == min(team_id)) * dummy, + team_id, + time_ref, + player_off + ), by = game_id][order(game_id, time_ref)][, .(team_id, + time_ref, + action_id, + player_off, + even = as.numeric(cumsum(team2_off) == cumsum(team1_off))), by = game_id] game_id team_id time_ref action_id player_off even 1: 100 10 1000 1 NA 1 2: 100 10 1001 1 NA 1 3: 100 10 1002 1 NA 1 4: 100 11 1003 1 NA 1 5: 100 11 1004 2 NA 1 6: 100 11 1005 1 NA 1 7: 100 10 1006 3 1 0 8: 100 11 1007 1 NA 0 9: 100 10 1008 1 NA 0 10: 100 10 1009 1 NA 0 11: 101 12 1000 3 0 1 12: 101 12 1001 1 NA 1 13: 101 12 1002 1 NA 1 14: 101 13 1003 2 NA 1 15: 101 13 1004 3 1 0 16: 101 12 1005 1 NA 0 17: 101 13 1006 1 NA 0 18: 101 13 1007 1 NA 0 19: 101 12 1008 1 NA 0 20: 101 12 1009 1 NA 0 team1_off = (team_id == min(team_id)) * dummy team2_off = (team_id == max(team_id)) * dummy

2018-04-02 14:33:29.661/IST WARN  [hadoop.hdfs.BlockReaderFactory] I/O error constructing remote block reader.
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3044)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:744)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:659)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:327)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:574)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:797)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:844)
at java.io.DataInputStream.read(DataInputStream.java:149)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.PushbackInputStream.read(PushbackInputStream.java:186)
at java.util.zip.ZipInputStream.readFully(ZipInputStream.java:403)
at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:278)
at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:122)
at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:220)
at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:181)
at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:125)

我认为你的问题非常有趣,我致力于使用data.table来解决这个问题。它花了我几个小时,我几乎放弃了data.table,认为data.table不能一次处理两个分组。我最终用逻辑乘法解决了它。

我很开心

2018-04-02 14:33:29.666/IST WARN  [hadoop.hdfs.DFSClient] Failed to connect to localhost/127.0.0.1:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3044)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:744)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:659)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:327)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:574)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:797)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:844)
at java.io.DataInputStream.read(DataInputStream.java:149)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.PushbackInputStream.read(PushbackInputStream.java:186)
at java.util.zip.ZipInputStream.readFully(ZipInputStream.java:403)
at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:278)
at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:122)
at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:220)
at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:181)
at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:125)