如何清理此数据框

时间:2016-12-02 12:20:26

标签: r data-cleaning

我的数据包含以下格式存储的变量:

             V2                             V3
1 Price :  33,990          Size : 16, 17 & 18.5"
2 Price :  30,830      Size : 13, 16, 18 & 19.5"
3 Price :  48,560             Sizes : 21 & 21.5"
4 Price :  33,790 Size : 17.5, 18.5, 19.5 & 21.5
5 Price :  37,990       Size : 17.5, 18.5 & 19.5
6 Price :  43,690      Size : 17.5, 18.5 & 19.5"

我需要的变量是PriceSize等。 R中最简洁的方法是将原始数据转换为如下格式:

            Price        Size
1          33,990       16, 17 & 18.5"
2          30,830       13, 16, 18 & 19.5"
3          48,560       21 & 21.5"
4          33,790       17.5, 18.5, 19.5 & 21.5
5          37,990       17.5, 18.5 & 19.5
6          43,690       17.5, 18.5 & 19.5"

此外,第三行的变量名拼写错误为Sizes而不是Size。我怎么能处理这个问题,因为有其他变量具有相同的错误?

修改 我不能使用列特定策略(例如,使用gsub()),因为给定列中的变量不一致。具体地,

                                           V20
1                        Grips : Bontrager SSR
2                  Headset : 1-1/8" threadless
3                                             
4          Brakeset : Tektro alloy linear-pull
5            Brakeset : HL 280 mechanical disc
6 Brakeset : Tektro M290 hydraulic disc brakes

列V20有3个唯一变量,GripsHeadsetBrakeset和空白。整洁的数据框应该类似于:

           Grips        Headset              Brakeset
1   Bontrager SSR       NA                   NA
2              NA       1-1/8" threadless    NA
3              NA       NA                   NA
4              NA       NA                   Tektro alloy linear-pull
5              NA       NA                   HL 280 mechanical disc
6              NA       NA                   Tektro M290 hydraulic disc brakes

这是过于简单化,因为我假设Brakeset没有前3行的值。这可能是也可能不是,因为该值可以存储在不同的列中。如果特定行没有给定变量的值,则使用NA。我希望这个问题很清楚。

1 个答案:

答案 0 :(得分:3)

library(tidyr)
# convert = T automatically converts to integer/numeric
df$Price <- separate(df, Price, into = c("x","y"), sep = ":", convert = T)[,2]
df$Size  <- separate(df, Size, into = c("x","y"), sep = ":")[,2]
# with gsub()
# irrespective of what is appearing before ":", gsub() shall take care of it
df$Price <- trimws(gsub(".*\\:", "",df$Price)) # this should work
# I'm using the below data to explain. This is obtained after using separate() once.
df1
          x                                  y
1    Grips                       Bontrager SSR
2    Grips                       Bontrager SSR
3  Headset                    1-1/8 threadless
4 Brakeset   Tektro M290 hydraulic disc brakes

# need to add a unique key to the data
> df1[["id"]] <- 1:nrow(df1)
> df1
          x                                  y id
1    Grips                       Bontrager SSR  1
2    Grips                       Bontrager SSR  2
3  Headset                    1-1/8 threadless  3
4 Brakeset   Tektro M290 hydraulic disc brakes  4

# using spread() from tidyr package
> spread(df1, x, y)
  id                          Brakeset          Grips           Headset 
1  1                               <NA>  Bontrager SSR              <NA>
2  2                               <NA>  Bontrager SSR              <NA>
3  3                               <NA>           <NA>  1-1/8 threadless
4  4  Tektro M290 hydraulic disc brakes           <NA>              <NA>