我有一个像
这样的数据框 ddf <- data.frame(
X = c("Fruit.Apple", "Fruit.Pear",
"Car.Mazda", "Car.Toyota",
"North.American.City.Chicago", "North.American.City.Ottawa", "North American.City.Toronto", "North.American.City.Los.Angeles", "Unique.Snowflake"),
Y = runif(9) # doesn't matter
)
X Y
1 Fruit.Apple 0.2655087
2 Fruit.Pear 0.3721239
3 Car.Mazda 0.5728534
4 Car.Toyota 0.9082078
5 North.American.City.Chicago 0.2016819
6 North.American.City.Ottawa 0.8983897
7 North.American.City.Toronto 0.9446753
8 North.American.City.Los.Angeles 0.6607978
9 Unique.Snowflake 0.6291140
我希望:
X.1 X.2 Y
1 Fruit Apple 0.2655087
2 Fruit Pear 0.3721239
3 Car Mazda 0.5728534
4 Car Toyota 0.9082078
5 North.American.City Chicago 0.2016819
6 North.American.City Ottawa 0.8983897
7 North.American.City Toronto 0.9446753
8 North.American.City Los.Angeles 0.6607978
9 <NA> Unique.Snowflake 0.6291140
我并不完全相信我的问题是可以解决的,但似乎有一种模式。我正在完成难以解决的问题。如果前缀很容易分开会很容易,但正如North.American.City
示例所示,有时前缀包含分隔符。如果后缀不包含.
但Los.Angeles
不应该分开,则相对简单。我也只希望真实的前缀出现在X.1
中,就像我用Unique.Snowflake
演示的那样。我唯一的想法是使用.
在gsub("(.*)\\..*$", "\\1", ...)
和一些嵌套的for循环之间创建所有文本的新列,以确定哪些是前缀,但必须有更好的方法。
答案 0 :(得分:2)
行。这是很多混乱的代码,但它完成了这项工作。我相信其他人可以提出更优雅的解决方案。
#sample vector for splitting
X = c("Fruit.Apple", "Fruit.Pear",
"Car.Mazda", "Car.Toyota",
"North.American.City.Chicago", "North.American.City.Ottawa",
"North.American.City.Toronto", "North.American.City.Los.Angeles",
"Unique.Snowflake"
)
#split on "." and prepare candidates
parts<-strsplit(X,".", fixed=T)
scores<-lapply(parts, function(p) {
lp<-length(p)
list(
c("",sapply(seq.int(p), function(x) paste(p[1:x], collapse="."))),
c(sapply(seq.int(p), function(x) paste(p[x:lp], collapse=".")),""),
seq.int(lp+1)
)
});
#now combine considerations
options<-do.call(rbind, lapply(seq.int(scores), function(i)
data.frame(
item=i,
prefix=scores[[i]][[1]],
suffix=scores[[i]][[2]],
depth=scores[[i]][[3]]))
)
#now add the freq score across all categories
options$freq=ave(rep.int(1,nrow(options)),options$prefix, FUN=length)
#finally, select the longest prefix combination that occurs >1 times
best<-do.call(rbind, by(options, options$item, function(x) {
x[order(x$freq<=1, -x$depth), ][1,]
}))
best[,2:3];
这导致
prefix suffix
1 Fruit Apple
2 Fruit Pear
3 Car Mazda
4 Car Toyota
5 North.American.City Chicago
6 North.American.City Ottawa
7 North.American.City Toronto
8 North.American.City Los.Angeles
9 Unique.Snowflake
答案 1 :(得分:0)
ddf2 <- gsub("\\.([[:alpha:]]*)$", " \\1", ddf$X)
data.frame( read.table(text=ddf2), Y=ddf$Y)
#------------
V1 V2 Y
1 Fruit Apple 0.8097010
2 Fruit Pear 0.6934737
3 Car Mazda 0.2143207
4 Car Toyota 0.5036963
5 North.American.City Chicago 0.7364826
6 North.American.City Ottawa 0.8603377
7 North.American.City Toronto 0.4751705
(您的示例构造代码不会使用“雪花” 你能描述为什么我们不会在最后一段时间拆分..)