我有一个庞大的作者和相应文本数据框(大约450,000条记录)。从数据框中,我分别为作者和文本提取了两个向量,例如:
author <- c("Sallust",
"Tacitus",
"Justin",
"Cato the Elder",
"Claudius",
"Quintus Fabius Pictor",
"Justin",
"Claudius",
"Cato the Elder",
"Tacitus",
"Sallust")
text <- c("Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet",
"Lorem ipsum dolor sit amet")
我的目标是将数据集的子集足够小,以便进行文本挖掘,但仍然保留同一块中具有相同author
的所有记录。
我注意到从原始数据帧中提取向量author
和text
是快速的但是在新数据帧中组合提取的向量非常慢。所以我想我应该避免使用所有记录创建数据框。
“智能”解决方案可能是:
author
(以确保记录
与同一作者是连续的); text
进行排序
author
; splitAt
author
和text
分割的位置; 在代码中,假设我的程序有意义,我得到了前3个步骤:
# Sort vectors
order <- order(author)
author <- author[order]
text <- text[order]
same_author <- duplicated(author)
但我不知道如何继续前进。可能应该是这样的:
# Index for splitting
max_length <- 2
num_records <- length(author)
num_chunks <- as.integer((num_records %/% max_length)) - 1
# Initialise vector (not sure it needs value 2 to indicate first index where to split)
splitAt <- 1
for (n in num_chunks){
index <- n * max_length + 1
while (same_author[index]!=FALSE) {
splitAt <- append(splitAt, index)
index <- index + 1
}
}
答案 0 :(得分:0)
我找到了这个解决方案(密钥算法来自here)。
# Sort vectors
order <- order(author)
author <- author[order]
text <- text[order]
same_author <- duplicated(author)
# Index for splitting
len_chunks <- 2
num_records <- length(author)
num_chunks <- as.integer((num_records %/% len_chunks)) - 1
# Initialise vector
splitAt_index <- numeric()
index <- len_chunks
for (n in 1:num_chunks){
while (same_author[index]!=FALSE) {
index <- index + 1
}
splitAt_index <- append(splitAt_index, index)
index <- index + len_chunks
}
# Function to split vector based on position indexes from https://stackoverflow.com/a/16358095/1707938
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))
author_list <- splitAt(author, splitAt_index)
text_list <- splitAt(text, splitAt_index)
for (i in 1:length(author_list)) {
m <- cbind(author_list[[i]],text_list[[i]])
assign(paste("corpus_",i , sep=""), m)
}
看起来很快。在MacBook Pro 2.4GHz 4GB上,带有5个字符向量,长度为448,634:
user system elapsed
13.248 0.174 13.662