我有两个数据框,第一个(dt
)包含所有chr
,第二个(TargetWord
)是包含chr
的字典也是。我使用pmatch
在dt
中搜索了TargetWord
中可用的字词,并从TargetWord
返回了该字词。当数据帧很小时,它工作正常。 但问题在数据帧很大时开始,它只返回第一列的单词位置,其余列变为NA 。
## Data Table
word_1 <- c("conflict","", "resolved", "", "", "")
word_2 <- c("", "one", "tricky", "one", "", "one")
word_3 <- c("thanks","", "", "comments", "par","")
word_4 <- c("thanks","", "", "comments", "par","")
word_5 <- c("", "one", "tricky", "one", "", "one")
dt <- data.frame(word_1, word_2, word_3,word_4, word_5, stringsAsFactors = FALSE)
## Targeted Words
TargetWord <- data.frame(cbind(c("conflict", "thanks", "tricky", "one", "two", "three")))
## convert into matrix (needed)
dt <- as.matrix(dt)
TargetWord <- as.matrix(TargetWord)
result <- `dim<-`(pmatch(dt, TargetWord, duplicates.ok=TRUE), dim(dt))
print(result)
返回结果,
[,1] [,2] [,3] [,4] [,5]
[1,] 1 NA 2 2 NA
[2,] NA 4 NA NA 4
[3,] NA 3 NA NA 3
[4,] NA 4 NA NA 4
[5,] NA NA NA NA NA
[6,] NA 4 NA NA 4
现在,在阅读了两个.csv
之后,结果仅适用于我想要的所有列的第一列,例如上面的结果。 Bellow,dt1 = 79 * 50数据帧,以及word_dict 13901 * 1数据帧。
#################### on big data #####################################
dt1 <- read.csv("C:/Users/Wonderland/Downloads/string_feature.csv", stringsAsFactors = FALSE)
word_dict <- read.csv("C:/Users/Wonderland/Downloads/word_dict.csv", stringsAsFactors = FALSE)
dt1 <- as.matrix(dt1)
word_dict <- as.matrix(word_dict)
result <- `dim<-`(pmatch(dt1, word_dict, duplicates.ok=TRUE), dim(dt1))
print(result)
答案 0 :(得分:0)
尝试申请:
apply(dt,2,function(x) pmatch(x,TargetWord,duplicates.ok = T))
正如您所看到的,结果是相同的,但它可能适用于庞大的数据框
word_1 word_2 word_3 word_4 word_5
[1,] 1 NA 2 2 NA
[2,] NA 4 NA NA 4
[3,] NA 3 NA NA 3
[4,] NA NA NA NA NA
[5,] NA NA NA NA NA
[6,] NA NA NA NA NA
我尝试过:
word_1 <- rep(c("conflict","", "resolved", "", "", ""),1000)
word_2 <- rep(c("", "one", "tricky", "one", "", "one"),1000)
word_3 <- rep(c("thanks","", "", "comments", "par",""),1000)
word_4 <- rep(c("thanks","", "", "comments", "par",""),1000)
word_5 <- rep(c("", "one", "tricky", "one", "", "one"),1000)
使用完全相同的代码并且有效。