清理数据表

时间:2018-11-06 22:16:09

标签: r dplyr data.table

我有一个data.table head(LocalCodes, n= 20) Local Codes 1: Crane, Indiana 0189 2: Rutland, Vermont 0401 3: NA 5003 4: Naval Air Station Patuxent River, Maryland 5001 5: Williamsburg, Virginia 7408 6: District of Columbia, District of Columbia 0132 7: Newport, Rhode Island 1702 8: NA 1805 9: NA 5306 10: Washington DC, District of Columbia / Kansas City, Missouri 2210 11: Kansas City, Missouri 0503 12: Arlington, Virginia 0501 13: Phoenix, Arizona 0301 14: Washington DC, District of Columbia 0132 15: NA 5001 16: Collbran, Colorado 0303 17: Washington DC, District of Columbia / Norfolk, Virginia 1102 18: Minot, North Dakota 1802 19: Washington DC, District of Columbia 2005 20: Pine Knot, Kentucky 4749

我正在尝试使用Good <- LocalCodes[ , list( LocalCodes$Local <- unlist( strsplit( LocalCodes$Local , " / " ) ) , by=LocalCodes$Codes)] 要在{/ 1上拆分Local,并在新数据表中保留相同的Codes

我一直收到错误消息Error in strsplit(LocalCodes$Local, " / ") : non-character argument

我确实尝试将as.character(LocalCodes$Local)添加到Good来消除错误,但是随后data.table无法正常工作。它分隔Local,但是Codes不会排成一行,因为Local现在是一个字符。

有没有办法分离Local并在正确的Codes上维护Local

示例: Local Codes 8: NA 1805 9: NA 5306 10: Kansas City, Missouri 2210 11: Washington DC, District of Columbia 2210 12: Kansas City, Missouri 0503 13: Arlington, Virginia 0501 14: Phoenix, Arizona 0301 15: Washington DC, District of Columbia 0132 16: NA 5001 17: Collbran, Colorado 0303 18: Norfolk, Virginia 1102 19: Washington DC, District of Columbia 1102 使用:Plyr,Dplyr,Data.Table

编辑:  这是dput输出:

dput(head(LocalCodes, n= 20)) structure(list(Local = list("Crane, Indiana", "Rutland, Vermont", "NA", "Naval Air Station Patuxent River, Maryland", "Williamsburg, Virginia", "District of Columbia, District of Columbia", "Newport, Rhode Island", "NA", "NA", "Washington DC, District of Columbia / Kansas City, Missouri", "Kansas City, Missouri", "Arlington, Virginia", "Phoenix, Arizona", "Washington DC, District of Columbia", "NA", "Collbran, Colorado", "Washington DC, District of Columbia / Norfolk, Virginia", "Minot, North Dakota", "Washington DC, District of Columbia", "Pine Knot, Kentucky"), Codes = list("0189", "0401", "5003", "5001", "7408", "0132", "1702", "1805", "5306", "2210", "0503", "0501", "0301", "0132", "5001", "0303", "1102", "1802", "2005", "4749")), class = c("data.table", "data.frame"), row.names = c(NA, -20L)

1 个答案:

答案 0 :(得分:1)

我的原始答案未能成功包含多个包含“ /”的项目。我有策略来处理data.table对象的变体,但是在过程中发现不幸的是您的结构是非标准的。请注意,dput输出以

开头
  

structure(list(Local = list(“ Crane,Indiana”,

典型的data.table不是列表列表。这种结构以搞乱data.frame操作而臭名昭著,而且显然也搞乱了data.table操作。这将修复您的数据对象,使其看起来像“普通”数据表。

LocalCodes[ , names(LocalCodes) := lapply(LocalCodes,unlist)]
#> dput(LocalCodes)
# structure(list(Local = c("Crane, Indiana", ...

现在,它不是列表列表。因此,现在尝试分别处理弦线内部与斜线之间不存在斜线的情况,然后将其捆绑在一起:

 rbind( LocalCodes[grepl("/",Local) ,
            cbind( data.table(Local=unlist( strsplit(Local, split="/")),
                                     Codes=rep(Codes,each=2)))],
        LocalCodes[!grepl("/",Local)] )
                                         Local Codes
 1:       Washington DC, District of Columbia   2210
 2:                      Kansas City, Missouri  2210
 3:       Washington DC, District of Columbia   1102
 4:                          Norfolk, Virginia  1102
 5:                             Crane, Indiana  0189
 6:                           Rutland, Vermont  0401
 7:                                         NA  5003
 8: Naval Air Station Patuxent River, Maryland  5001
 9:                     Williamsburg, Virginia  7408
10: District of Columbia, District of Columbia  0132
11:                      Newport, Rhode Island  1702
snipped-----