识别模式并将其转换为新列

时间:2018-05-18 12:26:04

标签: r dataframe dplyr pattern-matching bigdata

我正在使用存储在HTML中的大量表的项目中工作。在抓取过程中,我不得不处理以下问题。

Some of the tables that I am scraping look like this

我必须在此代码中输入一个read_html(link) %>% html_nodes(node) %>% html_table(fill = T, header = T, dec = ",") 参数,用于那些合并单元格的行(“鸡”和“没有骨头的鸡”),在我导入DF时:

   df <- data.frame(year = c("chicken",2000,2001,2002,"chicken without bones",2003,2004,2005, "chicken without bones and feet", 2006, 2007, 2008), 
                 weight = c("chicken",5,6,4,"chicken without bones",2,1,3,"chicken without bones and feet", 1, 1.5, 2)
                 )

但这为我生成的表格如下:

df2 <- data.frame(year = c(2000,2001,2002, 2003, 2004, 2005,2006,2007, 2008), number = c(5,6,4,2,1,3,1,1.5, 2), 
                 new_variable = c("chicken","chicken","chicken","chicken without bones","chicken without bones",
                                  "chicken without bones","chicken without bones and feet","chicken without bones and feet","chicken without bones and feet" )
                 )

试图找到一种方法让我的表看起来像这样:

-

我正在努力与R挣扎,但仍然不知道如何使用我的1.028.974表格进行刮擦。 Obs。:表格没有这种情况发生的模式;因为我需要一个标识填充节点的代码,将它们的值作为字符并将其转换为新的列值,直到下一次填充发生。

感谢您的关注!!

1 个答案:

答案 0 :(得分:0)

你可以试试这个 -

library(dplyr)
library(zoo)

df %>%
  mutate_if(is.factor, as.character) %>%
  mutate(new_variable = ifelse(grepl("\\D+", year), year, NA),
         new_variable = na.locf(new_variable)) %>%
  filter(!grepl("\\D+", year))

输出为:

  year weight                   new_variable
1 2000      5                        chicken
2 2001      6                        chicken
3 2002      4                        chicken
4 2003      2          chicken without bones
5 2004      1          chicken without bones
6 2005      3          chicken without bones
7 2006      1 chicken without bones and feet
8 2007    1.5 chicken without bones and feet
9 2008      2 chicken without bones and feet

示例数据:

df <- structure(list(year = structure(c(10L, 1L, 2L, 3L, 11L, 4L, 5L, 
6L, 12L, 7L, 8L, 9L), .Label = c("2000", "2001", "2002", "2003", 
"2004", "2005", "2006", "2007", "2008", "chicken", "chicken without bones", 
"chicken without bones and feet"), class = "factor"), weight = structure(c(8L, 
6L, 7L, 5L, 9L, 3L, 1L, 4L, 10L, 1L, 2L, 3L), .Label = c("1", 
"1.5", "2", "3", "4", "5", "6", "chicken", "chicken without bones", 
"chicken without bones and feet"), class = "factor")), class = "data.frame", row.names = c(NA, 
-12L))

#                             year                         weight
#1                         chicken                        chicken
#2                            2000                              5
#3                            2001                              6
#4                            2002                              4
#5           chicken without bones          chicken without bones
#6                            2003                              2
#7                            2004                              1
#8                            2005                              3
#9  chicken without bones and feet chicken without bones and feet
#10                           2006                              1
#11                           2007                            1.5
#12                           2008                              2