使用基于数据框的数字值替换字符串的有效方法"字典"

时间:2017-11-10 16:05:43

标签: r

这与这个问题有关:

How to convert a string of text into a vector based on given values numeric values to replace each letter with

为方便起见,我将在此处提供相同的信息:

import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.stream.Collectors;

public class ReadValidationFile {
    static List<String> validationFile = new ArrayList<>();
    static {
        validationFile.add("2017-10-29 00:00:00.0,\"1005\",-10227,0,0,0,332894,0,0,222,332894,222,332894");
        validationFile.add("2017-10-29 00:00:00.0,\"1010\",-125529,0,0,0,420743,0,0,256,420743,256,420743");
        validationFile.add("2017-10-29 00:00:00.0,\"1005\",-10227,0,0,0,332894,0,0,222,332894,222,332894");
        validationFile.add("2017-10-29 00:00:00.0,\"1013\",-10625,0,0,-687,599098,0,0,379,599098,379,599098");
        validationFile.add("2017-10-29 00:00:00.0,\"1604\",-1794.9,0,0,-3.99,4081.07,0,0,361,4081.07,361,4081.07");
    }

    public static void main(String[] args) {
        // Option 1 : unique lines only 
        Set<String> uniqueLinesOnly = new HashSet<>(validationFile);

        // Option 2 : unique lines and duplicate lines 
        Set<String> uniqueLines = new HashSet<>();
        Set<String> duplicateLines = new HashSet<>();
        for (String line : validationFile) {
            if (!uniqueLines.add(line.toLowerCase())) {
                duplicateLines.add(line.toLowerCase());
            }
        }

        // Option 3 : unique lines and duplicate lines by Java Streams
        Set<String> uniquesJava8 = new HashSet<>();
        List<String> duplicatesJava8 = validationFile
                                    .stream()
                                    .filter(element -> !uniquesJava8.add(element.toLowerCase()))
                                    .map(element -> element.toLowerCase())
                                    .collect(Collectors.toList());
    }
}

我想通过用5个数字定义每个氨基酸(字母),将每个字符串(长度为3)转换为长度为15的向量,如下表所示。

aminoacid <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")
aminoacid1 <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")
aminoacid2 <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")
df <- expand.grid(aminoacid, aminoacid1, aminoacid2)
df <- transform(df, newname = paste(df$Var1, df$Var2, df$Var3, sep=""))
ptuples <- df[,4]

这篇文章中提出的方法是:

key <-
  read.table(
    text = "          pah         pss         ms         cc          ec
    A -0.59145974 -1.30209266 -0.7330651  1.5703918 -0.14550842
    C -1.34267179  0.46542300 -0.8620345 -1.0200786 -0.25516894
    D  1.05015062  0.30242411 -3.6559147 -0.2590236 -3.24176791
    E  1.35733226 -1.45275578  1.4766610  0.1129444 -0.83715681
    F -1.00610084 -0.59046634  1.8909687 -0.3966186  0.41194139
    G -0.38387987  1.65201497  1.3301017  1.0449765  2.06385566
    H  0.33616543 -0.41662780 -1.6733690 -1.4738898 -0.07772917
    I -1.23936304 -0.54652238  2.1314349  0.3931618  0.81630366
    K  1.83146558 -0.56109831  0.5332237 -0.2771101  1.64762794
    L -1.01895162 -0.98693471 -1.5046185  1.2658296 -0.91181195
    M -0.66312569 -1.52353917  2.2194787 -1.0047207  1.21181214
    N  0.94535614  0.82846219  1.2991286 -0.1688162  0.93339498
    P  0.18862522  2.08084151 -1.6283286  0.4207004 -1.39177378
    Q  0.93056541 -0.17926549 -3.0048731 -0.5025910 -1.85303476
    R  1.53754853 -0.05472897  1.5021086  0.4403185  2.89744417
    S -0.22788299  1.39869991 -4.7596375  0.6701745 -2.64747356
    T -0.03181782  0.32571153  2.2134612  0.9078985  1.31337035
    V -1.33661279 -0.27854634 -0.5440132  1.2419935 -1.26225362
    W -0.59533918  0.00907760  0.6719274 -2.1275244 -0.18358096
    Y  0.25999617  0.82992312  3.0973596 -0.8380164  1.51150958"
  )

然而,当使用长度为10 ^ 9

的字符向量时,这非常低效并且计算量很大

如何有效地完成这项工作?我正在考虑使用包hashmap,但我不确定如何做到这一点。我仍然喜欢数据框中的输出,就像上面提出的解决方案一样。

谢谢!

1 个答案:

答案 0 :(得分:1)

以下几种方法似乎比目前的方法更快。

1)此方法仅使用一个循环并使用strsplit拆分'ptuples'

t(sapply(strsplit(as.character(ptuples), ""), function(x) c(t(key[x,])

2)我们paste为一个字符串,然后进行一次拆分,cbind进行子集化

m1 <- key[strsplit(paste(ptuples, collapse=""), "")[[1]],]

output3 <- cbind(m1[c(TRUE, FALSE, FALSE),], m1[c(FALSE, TRUE, FALSE),],
                m1[c(FALSE, FALSE, TRUE),])

基准

根据OP提供的数据集,system.time

 system.time({
output <- t(sapply(as.character(ptuples),
                   function(x) sapply(1:3, function(i) key[substr(x,i,i),])))
   })
# user  system elapsed 
#   3.13    0.00    3.28 


system.time({
   output2 <- t(sapply(strsplit(as.character(ptuples), ""), function(x) c(t(key[x,]))))
    })
#user  system elapsed 
#   1.50    0.01    1.52 

system.time({
m1 <- key[strsplit(paste(ptuples, collapse=""), "")[[1]],]

output3 <- cbind(m1[c(TRUE, FALSE, FALSE),], m1[c(FALSE, TRUE, FALSE),],
                    m1[c(FALSE, FALSE, TRUE),])
   })
#user  system elapsed 
#   0.01    0.00    0.02