使用公共前缀分隔字符列

时间:2014-05-08 02:25:49

标签: r

我有一个像

这样的数据框
 ddf <- data.frame(
  X = c("Fruit.Apple", "Fruit.Pear", 
        "Car.Mazda", "Car.Toyota", 
        "North.American.City.Chicago", "North.American.City.Ottawa", "North American.City.Toronto", "North.American.City.Los.Angeles", "Unique.Snowflake"), 
  Y = runif(9)  # doesn't matter
  )

                            X         Y
1                     Fruit.Apple 0.2655087
2                      Fruit.Pear 0.3721239
3                       Car.Mazda 0.5728534
4                      Car.Toyota 0.9082078
5     North.American.City.Chicago 0.2016819
6      North.American.City.Ottawa 0.8983897
7     North.American.City.Toronto 0.9446753
8 North.American.City.Los.Angeles 0.6607978
9                Unique.Snowflake 0.6291140

我希望:

                  X.1              X.2         Y
1               Fruit            Apple 0.2655087
2               Fruit             Pear 0.3721239
3                 Car            Mazda 0.5728534
4                 Car           Toyota 0.9082078
5 North.American.City          Chicago 0.2016819
6 North.American.City           Ottawa 0.8983897
7 North.American.City          Toronto 0.9446753
8 North.American.City      Los.Angeles 0.6607978
9                <NA> Unique.Snowflake 0.6291140

我并不完全相信我的问题是可以解决的,但似乎有一种模式。我正在完成难以解决的问题。如果前缀很容易分开会很容易,但正如North.American.City示例所示,有时前缀包含分隔符。如果后缀不包含.Los.Angeles不应该分开,则相对简单。我也只希望真实的前缀出现在X.1中,就像我用Unique.Snowflake演示的那样。我唯一的想法是使用.gsub("(.*)\\..*$", "\\1", ...)和一些嵌套的for循环之间创建所有文本的新列,以确定哪些是前缀,但必须有更好的方法。

2 个答案:

答案 0 :(得分:2)

行。这是很多混乱的代码,但它完成了这项工作。我相信其他人可以提出更优雅的解决方案。

#sample vector for splitting
X = c("Fruit.Apple", "Fruit.Pear", 
    "Car.Mazda", "Car.Toyota", 
    "North.American.City.Chicago", "North.American.City.Ottawa", 
    "North.American.City.Toronto", "North.American.City.Los.Angeles",
    "Unique.Snowflake"
)

#split on "." and prepare candidates
parts<-strsplit(X,".", fixed=T)
scores<-lapply(parts, function(p) {
    lp<-length(p)
    list(
        c("",sapply(seq.int(p), function(x) paste(p[1:x], collapse="."))),
        c(sapply(seq.int(p), function(x) paste(p[x:lp], collapse=".")),""),
        seq.int(lp+1)
    )
});

#now combine considerations
options<-do.call(rbind, lapply(seq.int(scores), function(i) 
    data.frame(
        item=i, 
        prefix=scores[[i]][[1]], 
        suffix=scores[[i]][[2]],
        depth=scores[[i]][[3]]))
    )
#now add the freq score across all categories
options$freq=ave(rep.int(1,nrow(options)),options$prefix, FUN=length)

#finally, select the longest prefix combination that occurs >1 times
best<-do.call(rbind, by(options, options$item, function(x) {
    x[order(x$freq<=1, -x$depth), ][1,]
}))
best[,2:3];

这导致

               prefix           suffix
1               Fruit            Apple
2               Fruit             Pear
3                 Car            Mazda
4                 Car           Toyota
5 North.American.City          Chicago
6 North.American.City           Ottawa
7 North.American.City          Toronto
8 North.American.City      Los.Angeles
9                     Unique.Snowflake

答案 1 :(得分:0)

 ddf2 <- gsub("\\.([[:alpha:]]*)$", " \\1", ddf$X)
 data.frame( read.table(text=ddf2), Y=ddf$Y)
#------------
                   V1      V2         Y
1               Fruit   Apple 0.8097010
2               Fruit    Pear 0.6934737
3                 Car   Mazda 0.2143207
4                 Car  Toyota 0.5036963
5 North.American.City Chicago 0.7364826
6 North.American.City  Ottawa 0.8603377
7 North.American.City Toronto 0.4751705

(您的示例构造代码不会使用“雪花”  你能描述为什么我们不会在最后一段时间拆分..)