这与这个问题有关:
为方便起见,我将在此处提供相同的信息:
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.stream.Collectors;
public class ReadValidationFile {
static List<String> validationFile = new ArrayList<>();
static {
validationFile.add("2017-10-29 00:00:00.0,\"1005\",-10227,0,0,0,332894,0,0,222,332894,222,332894");
validationFile.add("2017-10-29 00:00:00.0,\"1010\",-125529,0,0,0,420743,0,0,256,420743,256,420743");
validationFile.add("2017-10-29 00:00:00.0,\"1005\",-10227,0,0,0,332894,0,0,222,332894,222,332894");
validationFile.add("2017-10-29 00:00:00.0,\"1013\",-10625,0,0,-687,599098,0,0,379,599098,379,599098");
validationFile.add("2017-10-29 00:00:00.0,\"1604\",-1794.9,0,0,-3.99,4081.07,0,0,361,4081.07,361,4081.07");
}
public static void main(String[] args) {
// Option 1 : unique lines only
Set<String> uniqueLinesOnly = new HashSet<>(validationFile);
// Option 2 : unique lines and duplicate lines
Set<String> uniqueLines = new HashSet<>();
Set<String> duplicateLines = new HashSet<>();
for (String line : validationFile) {
if (!uniqueLines.add(line.toLowerCase())) {
duplicateLines.add(line.toLowerCase());
}
}
// Option 3 : unique lines and duplicate lines by Java Streams
Set<String> uniquesJava8 = new HashSet<>();
List<String> duplicatesJava8 = validationFile
.stream()
.filter(element -> !uniquesJava8.add(element.toLowerCase()))
.map(element -> element.toLowerCase())
.collect(Collectors.toList());
}
}
我想通过用5个数字定义每个氨基酸(字母),将每个字符串(长度为3)转换为长度为15的向量,如下表所示。
aminoacid <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")
aminoacid1 <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")
aminoacid2 <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")
df <- expand.grid(aminoacid, aminoacid1, aminoacid2)
df <- transform(df, newname = paste(df$Var1, df$Var2, df$Var3, sep=""))
ptuples <- df[,4]
这篇文章中提出的方法是:
key <-
read.table(
text = " pah pss ms cc ec
A -0.59145974 -1.30209266 -0.7330651 1.5703918 -0.14550842
C -1.34267179 0.46542300 -0.8620345 -1.0200786 -0.25516894
D 1.05015062 0.30242411 -3.6559147 -0.2590236 -3.24176791
E 1.35733226 -1.45275578 1.4766610 0.1129444 -0.83715681
F -1.00610084 -0.59046634 1.8909687 -0.3966186 0.41194139
G -0.38387987 1.65201497 1.3301017 1.0449765 2.06385566
H 0.33616543 -0.41662780 -1.6733690 -1.4738898 -0.07772917
I -1.23936304 -0.54652238 2.1314349 0.3931618 0.81630366
K 1.83146558 -0.56109831 0.5332237 -0.2771101 1.64762794
L -1.01895162 -0.98693471 -1.5046185 1.2658296 -0.91181195
M -0.66312569 -1.52353917 2.2194787 -1.0047207 1.21181214
N 0.94535614 0.82846219 1.2991286 -0.1688162 0.93339498
P 0.18862522 2.08084151 -1.6283286 0.4207004 -1.39177378
Q 0.93056541 -0.17926549 -3.0048731 -0.5025910 -1.85303476
R 1.53754853 -0.05472897 1.5021086 0.4403185 2.89744417
S -0.22788299 1.39869991 -4.7596375 0.6701745 -2.64747356
T -0.03181782 0.32571153 2.2134612 0.9078985 1.31337035
V -1.33661279 -0.27854634 -0.5440132 1.2419935 -1.26225362
W -0.59533918 0.00907760 0.6719274 -2.1275244 -0.18358096
Y 0.25999617 0.82992312 3.0973596 -0.8380164 1.51150958"
)
然而,当使用长度为10 ^ 9
的字符向量时,这非常低效并且计算量很大如何有效地完成这项工作?我正在考虑使用包hashmap,但我不确定如何做到这一点。我仍然喜欢数据框中的输出,就像上面提出的解决方案一样。
谢谢!
答案 0 :(得分:1)
以下几种方法似乎比目前的方法更快。
1)此方法仅使用一个循环并使用strsplit
拆分'ptuples'
t(sapply(strsplit(as.character(ptuples), ""), function(x) c(t(key[x,])
2)我们paste
为一个字符串,然后进行一次拆分,cbind
进行子集化
m1 <- key[strsplit(paste(ptuples, collapse=""), "")[[1]],]
output3 <- cbind(m1[c(TRUE, FALSE, FALSE),], m1[c(FALSE, TRUE, FALSE),],
m1[c(FALSE, FALSE, TRUE),])
根据OP提供的数据集,system.time
是
system.time({
output <- t(sapply(as.character(ptuples),
function(x) sapply(1:3, function(i) key[substr(x,i,i),])))
})
# user system elapsed
# 3.13 0.00 3.28
system.time({
output2 <- t(sapply(strsplit(as.character(ptuples), ""), function(x) c(t(key[x,]))))
})
#user system elapsed
# 1.50 0.01 1.52
system.time({
m1 <- key[strsplit(paste(ptuples, collapse=""), "")[[1]],]
output3 <- cbind(m1[c(TRUE, FALSE, FALSE),], m1[c(FALSE, TRUE, FALSE),],
m1[c(FALSE, FALSE, TRUE),])
})
#user system elapsed
# 0.01 0.00 0.02