在R中的data.table对象内操作char向量

时间:2015-11-18 16:40:55

标签: r string data.table strsplit

我对使用data.table并理解其所有细微之处仍然有点新鲜。 我查看了文档和其他示例中的SO但找不到我想要的内容,所以请帮忙!

我有一个data.table,它基本上是一个char矢量(每个条目都是一个句子)

DT=c("I love you","she loves me")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)

# > DT
#            text
# 1:   I love you
# 2: she loves me

我想做的是能够在DT对象内执行一些基本的字符串操作。例如,添加一个新列,其中我将有一个char向量,其中每个条目都是来自“text”列中字符串的WORD。

所以我想要一个新的列charvec

> DT[1]$charvec
[1] "I" "love "you"

当然,我想以data.table方式,超快速,因为我需要在&gt; 1Go文件的fils上做这种事情,并使用更复杂和计算量大的功能。所以不使用APPLY,LAPPLY和MAPPLY

我最接近尝试做的事情如下:

myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]
# > DU2
#            text      charvec
# 1:   I love you   I,love,you
# 2: she loves me she,loves,me

例如,要创建一个删除每个句子的第一个单词的函数,我就这样做了

myfun2 <- function(l){l[[1]][-1]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]
# > DV2
#            text  charvec
# 1:   I love you love,you
# 2: she loves me loves,me
问题是,在charvec列中,我有一个列表而不是矢量......

> str(DU2[1]$charvec)
# List of 1
# $ : chr [1:3] "I" "love" "you"

1)我怎样才能做我想做的事? 我想要使​​用的其他类型的函数是对char向量进行子集化,或者对它应用一些哈希等。

2)顺便说一句,我可以在一行而不是两行中找到DU2或DV2吗? 3)我不太了解data.table的语法。为什么在[..]中使用命令list(),列V1消失了? 4)在另一个线程上,我读了一下关于函数cSplit

。这有什么好处吗?它是一个适用于data.table对象的函数吗?

非常感谢

更新

感谢@Ananda Mahto 也许我应该让自己更清楚自己的最终目标 我有一个10,000,000个句子的巨大文件存储为字符串。 作为该项目的第一步,我想对每个句子的前5个单词进行哈希处理。 10,000,000个句子甚至都没有留在我的记忆中,所以我首先分成了10个1,000,000个句子的文件,大约是10个1Go文件。 以下代码在我的笔记本电脑上只需几分钟就可以获得一个文件。

library(data.table); library(digest);
num_row=1000000
DT <- fread("sentences.txt",nrows=num_row,header=FALSE,sep="\t",colClasses="character")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)
rawdata <- DT

hash2 <- function(word){ #using library(digest)
        as.numeric(paste("0x",digest(word,algo="murmur32"),sep=""))
}

然后,

print(system.time({ 

        colnames(rawdata) <- "sentence"
        rawdata <- lapply(rawdata,strsplit," ")

        sentences_begin <- lapply(rawdata$sentence,function(x){x[2:6]})
        hash_list <- sapply(sentences_begin,hash2)
        # remove(rawdata)
})) ## end of print system.time for loading the data

我知道我正在将R推到极限,但我正在努力寻找更快的实现,我正在考虑data.table功能......因此我所有的问题

这是一个不包括lapply的实现,但它实际上更慢!

print(system.time({
myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]

myfun2 <- function(l){l[[1]][2:6]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]

rebuildsentence <- function(S){
        paste(S,collapse=" ") }

myfun3 <- function(l){hash2(rebuildsentence(l[[1]]))}

DW1 <- DV2[,myfun3(charvec),by=text]

})) #end of system.time

在这个带有数据文件的实现中,没有lapply,所以我希望哈希会更快。但是因为在每一列中我都有一个列表而不是一个char矢量,这可能会显着减缓(?)整个事情。

使用上面的第一个代码(使用lapply / sapply)在我的笔记本电脑上花了1个多小时。我希望通过更有效的数据结构加快速度?使用Python,Java等的人......在几秒钟内完成类似的工作。

当然,另一条道路是寻找更快的哈希函数,但我认为digest包中的那个已经优化了。

1 个答案:

答案 0 :(得分:3)

我不确定您的目标是什么,但您可以从我的“splitstackshape”软件包中尝试cSplit_l来访问您的列表列:

library(splitstackshape)
DU <- cSplit_l(DT, "DT", " ")

然后,您可以编写类似以下的函数来从列表列中删除值:

RemovePos <- function(inList, pos = 1) {
  lapply(inList, function(x) x[-c(pos[pos <= length(x)])])
}

使用示例:

DU[, list(RemovePos(DT_list, 1)), by = DT]
#              DT       V1
# 1:   I love you love,you
# 2: she loves me loves,me
DU[, list(RemovePos(DT_list, 2)), by = DT]
#              DT     V1
# 1:   I love you  I,you
# 2: she loves me she,me
DU[, list(RemovePos(DT_list, c(1, 2))), by = DT]
#              DT  V1
# 1:   I love you you
# 2: she loves me  me

更新

根据您对“lapply”的厌恶,也许您可​​以尝试以下内容:

## make a copy of your "text" column
DT[, vals := text]

## Use `cSplit` to create a "long" dataset. 
## Add a column to indicate the word's position in the text.
DTL <- cSplit(DT, "vals", " ", "long")[, ind := sequence(.N), by = text][]
DTL
#            text  vals ind
# 1:   I love you     I   1
# 2:   I love you  love   2
# 3:   I love you   you   3
# 4: she loves me   she   1
# 5: she loves me loves   2
# 6: she loves me    me   3

## Now, you can extract values easily
DTL[ind == 1]
#            text vals ind
# 1:   I love you    I   1
# 2: she loves me  she   1
DTL[ind %in% c(1, 3)]
#            text vals ind
# 1:   I love you    I   1
# 2:   I love you  you   3
# 3: she loves me  she   1
# 4: she loves me   me   3

更新2

我不知道你会得到什么类型的时间,但正如我在评论中提到的,你可以尝试使用正则表达式,这样你就不必拆分然后再将字符串粘贴在一起。

这是一个样本......

设置一些数据:

library(data.table)
DT <- data.table(
  text = c("This is a sentence with a lot of words.",
           "This is a sentence with some more words.",
           "Words and words and even some more words.",
           "But, I don't know how you want to deal with punctuation...",
           "Just one more sentence, for easy multiplication.")
)

DT2 <- rbindlist(replicate(10000/nrow(DT), DT, FALSE))
DT3 <- rbindlist(replicate(1000000/nrow(DT), DT, FALSE))

测试gsub模式,从每个句子中提取5个单词....

## Regex to extract first five words -- this should work....
patt <- "^((?:\\S+\\s+){4}\\S+).*"

## Check out some of the timings
system.time(temp <- DT2[, gsub(patt, "\\1", text)])
#    user  system elapsed 
#    0.03    0.00    0.03 
system.time(temp2 <- DT3[, gsub(patt, "\\1", text)])
#    user  system elapsed 
#       3       0       3 
head(temp)
# [1] "This is a sentence with"     "This is a sentence with"     "Words and words and even"   
# [4] "But, I don't know how"       "Just one more sentence, for" "This is a sentence with" 

我猜你想做什么......

## I'm assuming you want something like this....
## Takes about a minute on my system. 
## ... but note the system time for the creation of "temp2" (without digest)
## Not sure if I interpreted your hash requirement correctly....
system.time(out <- DT3[
  , firstFive := gsub(patt, "\\1", text)][
  , firstFiveHash := hash2(firstFive), by = 1:nrow(DT3)][])
#    user  system elapsed 
#   62.14    0.05   62.20 

head(out)
#                                                          text                   firstFive firstFiveHash
# 1:                    This is a sentence with a lot of words.     This is a sentence with    4179639471
# 2:                   This is a sentence with some more words.     This is a sentence with    4179639471
# 3:                  Words and words and even some more words.    Words and words and even    2556713080
# 4: But, I don't know how you want to deal with punctuation...       But, I don't know how    3765680401
# 5:           Just one more sentence, for easy multiplication. Just one more sentence, for     298317689
# 6:                    This is a sentence with a lot of words.     This is a sentence with    4179639471