使用R进行数据清理时需要帮助

时间:2015-12-18 19:26:29

标签: r data-cleaning

   "id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,

我需要重新格式化如下。

id gender age category            rank
1 Male    22  movies               1
1 Male    22  music                2
1 Male    22  travel               3
1 Male    22  cloths               4
1 Male    22  grocery              5
1 Male    22  books                NA
1 Male    22  rent                 NA
1 Male    22  fuel                 NA
1 Male    22  utility              NA
1 Male    22  online-shopping      NA

到目前为止,我的努力如下:

mini <- read.csv("coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by  "V1"')

现在我想知道为每个用户填写所有缺失类别的最佳方法是什么。 在这方面的任何帮助表示赞赏。

1 个答案:

答案 0 :(得分:1)

考虑使用基函数reshape,因为这是从长到长的数据集整形/旋转的常规示例:

reshapedf <- reshape(df, varying = c(4:13), 
                     v.names = c("category"),
                     timevar=c("rank"), 
                     times = c(1:10),
                     idvar = c("id", "gender", "age"), 
                     new.row.names = 1:1000,
                     direction = "long")

# ORDER RESULTING DATA FRAME
reshapedf <- reshapedf[with(reshapedf , order(id, gender, age)), ]
# RESET ROW NAMES
row.names(reshapedf) <- 1:nrow(reshapedf)

<强>输出

        id      gender      age     rank    category
1       1       Male        22      1       movies
2       1       Male        22      2       music
3       1       Male        22      3       travel
4       1       Male        22      4       cloths
5       1       Male        22      5       grocery
6       1       Male        22      6       NA
7       1       Male        22      7       NA
8       1       Male        22      8       NA
9       1       Male        22      9       NA
10      1       Male        22      10      NA
...