使用dplyr :: lag整理数据框并填充变量

时间:2018-12-30 19:18:55

标签: r dplyr data-cleaning

我正在尝试清除数据,以便将包含“ gamecentre-playbyplay-event”的一行正下方的每一行都标记为目标,而将包含“ gamecentre-playbyplay-event”的每一行都直接标记为“目标” “行被标记为主要辅助,并且在“主要辅助”行正下方包含“游戏中心-玩法-玩法-事件”的每一行都被标记为辅助。

数据如下:

mydata

# A tibble: 15 x 1
   value                                                                                 
   <chr>                                                                                 
 1 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-bat gamecentre-playby"   
 2 "<a href=\"/players/14695\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
 3 "<a href=\"/players/16639\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
 4 "<a href=\"/players/17027\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
 5 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby"   
 6 "<a href=\"/players/17453\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
 7 "<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
 8 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby"   
 9 "<a href=\"/players/18061\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
10 "<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
11 "<a href=\"/players/17522\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
12 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby"   
13 "<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
14 "<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
15 "<a href=\"/players/14757\" class=\"gamecentre__link gamecentre__link--goal\" data-re"

这里还是有一些问题。

  1. 我需要设置条件,以便正确标记行。
  2. 如果没有“辅助”行,则将该行标记为NA
  3. 如果没有“主要辅助”行,则该行也被标记为NA

为此,我尝试使用dplyr::lag(),但是在没有主要或次要辅助的情况下,我想要NA会令人困惑。

这是我到目前为止所拥有的基础:

goals <- mydata %>%
  filter(dplyr::lag(str_detect(value, "gamecentre-playbyplay-event team-border"), 1))

goals

# A tibble: 4 x 1
  value                                                                                                                                
  <chr>                                                                                                                                
1 "<a href=\"/players/14695\" class=\"gamecentre__link gamecentre__link--goal\" data-re
2 "<a href=\"/players/17453\" class=\"gamecentre__link gamecentre__link--goal\" data-re
3 "<a href=\"/players/18061\" class=\"gamecentre__link gamecentre__link--goal\" data-re
4 "<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re

这就是我希望我的数据在所有这些末尾显示的样子。我认为使用dplyr::lag()是可行的方法,但我不确定。

# A tibble: 4 x 3
  goal                                     primary_assist                                secondary_assist                              
  <chr>                                    <chr>                                         <chr>                                         
1 "<a href=\"/players/14695\" class=\"gam~ "<a href=\"/players/16639\" class=\"gamecent~ "<a href=\"/players/17027\" class=\"gamecentr~
2 "<a href=\"/players/17453\" class=\"gam~ "<a href=\"/players/14639\" class=\"gamecent~ NA                                            
3 "<a href=\"/players/18061\" class=\"gam~ "<a href=\"/players/14752\" class=\"gamecent~ "<a href=\"/players/17522\" class=\"gamecentr~
4 "<a href=\"/players/14752\" class=\"gam~ "<a href=\"/players/14639\" class=\"gamecent~ "<a href=\"/players/14757\" class=\"gamecentr~

有什么想法吗?

dput:

    mydata <- structure(list(value = c("<div class=\"gamecentre-playbyplay-event team-border--lhjmq-bat gamecentre-playby", 
"<a href=\"/players/14695\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/16639\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/17027\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby", 
"<a href=\"/players/17453\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby", 
"<a href=\"/players/18061\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/17522\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby", 
"<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re", 
"<a href=\"/players/14757\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
)), .Names = "value", class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -15L))

1 个答案:

答案 0 :(得分:4)

一种选择是创建一个分组变量,然后创建spread

library(tidyverse)
mydata %>%
   #create a group based on the occurrence of 'playby'
   group_by(grp = cumsum(str_detect(value, 'playby'))) %>% 
   # filter out the first row of the group that have playby
   filter(row_number() > 1) %>% 
   # create a new category column
   mutate(categ = c("goal", "primary_assist", "secondary_assist")[row_number()]) %>%
   # spread from long to wide
   spread(categ, value) %>% 
   # remove the grouping column as part of clean up
   ungroup %>% 
   select(-grp)
# A tibble: 4 x 3
#  goal                                   primary_assist                              secondary_assist                           
#  <chr>                                  <chr>                                       <chr>                                      
#1 "<a href=\"/players/14695\" class=\"g… "<a href=\"/players/16639\" class=\"gamece… "<a href=\"/players/17027\" class=\"gamece…
#2 "<a href=\"/players/17453\" class=\"g… "<a href=\"/players/14639\" class=\"gamece… <NA>                                       
#3 "<a href=\"/players/18061\" class=\"g… "<a href=\"/players/14752\" class=\"gamece… "<a href=\"/players/17522\" class=\"gamece…
#4 "<a href=\"/players/14752\" class=\"g… "<a href=\"/players/14639\" class=\"gamece… "<a href=\"/players/14757\" class=\"gamece…