在r中,如何根据其他向量评估两个向量

时间:2017-01-23 23:38:03

标签: r vector replication matching

我对r比较新,所以感谢任何帮助

样本数据集,还附加了数据集的图像。 image is of sample dataset

a           B           C           D
12.97221,   64.78909    1           2
69.64817,   321.90037   2           28
318.87946,  259.29946   3           5
326.17622,  94.7089     9           8
137.54006,  325.34917   5           88
258.06002,  94.77531    6           63
258.92824,  322.20164   7           64
98.57514,   12.96828    8           34
98.46303,   139.27264   9           21
317.22764,  261.25563   10          97

我的目标:我需要

1) look at value in column A
2) find the nearest/closest number in column B
3) test to see if the value in column B has already been selected
4) if the value in column B has already been selected, then ignore and choose the next closest value.
5) once a new, non-duplicated, value is chosen from column B, then
6) Test to see if the value in column C that is on the same row as the value of interest in column A is not the same as the value in column D on the same row as the nearest chosen value in column B
7) if the values in column C and D are NOT the same, then 
8) return the value from column B into a new column
9) if the value in column C and D are the same, then repeat steps 4-7 until a) a new, non-duplicated value is chosen, and b) the value in C and D are not equal. 

这是我到目前为止的代码,这段代码解决了找到最近的数字"没有替换的问题"但是,在选择B列中的值之前,没有考虑C和D列中类似值的问题;由" Chase"开发从这里:How to get the closest element in a vector for every element in another vector without duplicates?

foo <- function(a,b) {

  out <- cbind(a, bval = NA)

  for (i in seq_along(a)) {
    #which value of B is closest?
    whichB <- which.min(abs(b - a[i]))
    #Assign that value to the bval column
    out[i, "bval"] <- b[whichB]
    #Remove that value of B from being chosen again
    b <- b[-whichB]
  }

  return(out)

希望这(下面)是一个更好的描述和我的问题的例子。

查看调整后的表格以更好地显示我的问题。查看12.97221列A中的值,然后评估列B并选择值12.96828。然后评估C列中对应于12.97221的值,即1;然后查看对应于12.96828的D列中的值(d = 34中的值)。由于未选择12.96828列B中的值且C列和D列中的值不匹配,我希望它在E列中返回12.96828。接下来它将查看A列69.64817中的第2个值并进行比较它应该是B列中的值,它应该选择64.78909,然后评估它是否已被选中。然后评估列C(2)中对应于列B中的值的值,并评估列D(2)中对应于列C中所选值的值,尽管这是第一次选择64.78909,列C中的值和D是相同的,因此我需要从94.7089的B列中选择下一个最接近的值,然后评估它是否已被选中;它没有。然后评估C列中与A列中的值相对应的值(C = 2中的值),并评估D列中对应于94.7089的值(D中的值为34)并比较它们。由于尚未选择94.7089的值,列C中的值(C = 2中的值)和D(D = 34中的值)不相同,因此将94.7089返回到E列。

再次,先谢谢,希望我充分描述我的问题

A行98行。

     a         b  c  d
1   12.97221 297.91173  1  1
2   69.64817 298.19087  2  2
3  318.87946 169.03864  3  3
4  326.17622 169.32014  4  4
5  137.54006 336.65953  5  5
6  258.06002  94.70890  6  6
7  258.92824  94.77531  7  7
8   98.57514 290.19832  8  8
9   98.46303 290.40790  9  9
10 317.22764 154.38380 10 10
11 316.64421 148.78655 11 11
12 310.73702 153.32877 12 12
13 237.32708 107.83971 13 13
14 250.65386 108.05706 14 14
15 337.09543 180.63118 15 15
16 337.03365 181.02949 16 16
17 301.22772 185.20628 17 17
18 332.93530 185.97922 18 18
19 340.84127 220.40438 19 19
20 357.42706 220.83922 20 20
21 244.89806  83.18630 21 21
22 244.84391  83.28693 22 22
23  97.16921 338.39649 23 23
24 114.62798 338.43398 24 24
25 178.90640  53.22144 25 25
26 175.59257  57.77149 26 26
27 173.32116  60.62938 27 27
28 172.20906  61.93639 28 28
29 246.51226 150.04782 29 29
30 258.00836 150.65750 30 30
31 259.85790 156.03397 31 31
32 326.10208 230.30117 32 32
33 324.96532 230.59314 33 33
34 319.40851 233.05470 34 34
35 146.11989  10.86714 35 35
36 144.63489  12.96828 36 36
37 139.89335  18.90677 37 37
38 119.96566  18.75278 38 38
39 109.18017  28.03931 39 39
40 108.24683  28.87934 40 40
41 302.29211 230.30386 41 41
42 297.28305 233.96142 42 42
43 244.72843  77.53609 43 43
44 244.55468  77.62372 44 44
45 243.47944  78.07812 45 45
46 181.89548  55.90604 46 46
47 180.80139  55.99444 47 47
48 150.37128  59.83512 48 48
49  51.28074 279.08373 49 49
50  50.95031 279.21971 50 50
51  50.57658 279.37713 51 51
52  48.12937 281.07891 52 52
53 154.16485  22.38683 53 53
54 153.48482  22.52214 54 54
55 145.03992  27.13075 55 55
56 108.21414  31.28673 56 56
57 270.96258 182.05611 57 57
58 269.78887 149.38115 58 58
59 256.37371 154.75579 59 59
60 153.74159  25.74645 60 60
61 151.10381  21.27617 61 61
62  97.67447  25.97402 62 62
63  60.73636 259.29946 63 63
64  11.86492 261.25563 64 64
65 287.19987 262.01448 65 65
66 312.08016 234.55050 66 66
67 315.96324 234.79214 67 67
68 323.03643 235.31352 68 68
69  32.71810 333.35849 69 69
70  59.63687 337.21593 70 70
71 276.34373 115.55930 71 71
72 276.31857 115.67837 72 72
73 275.19374 119.76535 73 73
74  97.94697 288.88226 74 74
75  97.60657 289.19108 75 75
76  97.53337 289.26658 76 76
77 173.02153  84.88042 77 77
78 171.27572  86.35787 78 78
79 169.44530  87.38803 79 79
80  87.67228 297.48545 80 80
81  87.54748 297.88451 81 81
82  86.59445 301.10765 82 82
83 332.49688 185.82157 83 83
84 331.19924 186.74459 84 84
85 222.30368  63.98160 85 85
86 221.44599  64.24739 86 86
87 219.66419  64.78909 87 87
88 229.48482 139.27264 88 88
89 228.76817 109.94767 89 89
90 214.77135 105.61337 90 90
91 208.44254 107.75702 91 91
92 224.10799  84.52048 92 92
93 222.94849  87.27893 93 93
94 222.54903  88.00606 94 94
95 222.13538  88.80756 95 95
96 110.52286 321.90037 96 96
97 109.56354 322.20164 97 97
98  75.80737 325.34917 98 98

1 个答案:

答案 0 :(得分:1)

所以这是你的答案; 解释嵌入在答案中 (我已从您的数据集中删除了逗号)

setwd("~/Desktop/")
df <- read.table("trial.txt",header=T,sep="\t")
names(df) <- c("a","B","C","D")
df_backup <- df
df$newcol <- NA

used <- c()
for (i in seq(1,length(df$a),1)){
  print("######## Separator ########")
  print(paste("searching right match that fits criteria for ",df$a[i],"in column 'a'",sep=""))
  valueA <- df[i,1]
  orderx <- order(abs(df$B-valueA))

  index=1
  while (is.na(df$newcol[i])) {
    j=orderx[index]
    if (df$B[j] %in% used){
      print(paste("passing ",df$B[j], "as it has already been used",sep=""))
      index=index+1
      next
    } else {
      indexB <- j
      valueB <- df$B[indexB]
      print(paste("trying ",valueB,sep=""))

      if (df$C[i] != df$D[indexB]) {
        df$newcol[i] <- df$B[indexB]
        print(paste("using ",valueB,sep=""))
        used <- c(used,df$B[indexB])
      } else {
        df$newcol[i] <- NA
        print(paste("cant use ",valueB,"as the column C (related to index in A) and D (related to index in B) values are matching",sep=""))
      }

      index=index+1
    }
  }
}

输出看起来像这样

[1] "######## Separator ########"
[1] "searching right match that fits criteria for 12.97221in column 'a'"
[1] "trying 12.96828"
[1] "using 12.96828"
[1] "######## Separator ########"
[1] "searching right match that fits criteria for 69.64817in column 'a'"
[1] "trying 64.78909"
[1] "cant use 64.78909as the column C (related to index in A) and D (related to index in B) values are matching"
[1] "trying 94.7089"
[1] "using 94.7089"
[1] "######## Separator ########"
[1] "searching right match that fits criteria for 318.87946in column 'a'"
[1] "trying 321.90037"
[1] "using 321.90037"
[1] "######## Separator ########"
[1] "searching right match that fits criteria for 326.17622in column 'a'"
[1] "trying 325.34917"
[1] "using 325.34917"
[1] "######## Separator ########"
[1] "searching right match that fits criteria for 137.54006in column 'a'"
[1] "trying 139.27264"
[1] "using 139.27264"
[1] "######## Separator ########"
[1] "searching right match that fits criteria for 258.06002in column 'a'"
[1] "trying 259.29946"
[1] "using 259.29946"
[1] "######## Separator ########"
[1] "searching right match that fits criteria for 258.92824in column 'a'"
[1] "passing 259.29946as it has already been used"
[1] "trying 261.25563"
[1] "using 261.25563"
[1] "######## Separator ########"
[1] "searching right match that fits criteria for 98.57514in column 'a'"
[1] "trying 94.77531"
[1] "using 94.77531"
[1] "######## Separator ########"
[1] "searching right match that fits criteria for 98.46303in column 'a'"
[1] "passing 94.77531as it has already been used"
[1] "passing 94.7089as it has already been used"
[1] "trying 64.78909"
[1] "using 64.78909"
[1] "######## Separator ########"
[1] "searching right match that fits criteria for 317.22764in column 'a'"
[1] "passing 321.90037as it has already been used"
[1] "trying 322.20164"
[1] "using 322.20164"

决赛桌如下:

1   2.97221 64.78909    1   2   12.96828
2   69.64817    321.90037   2   28  94.7089
3   318.87946   259.29946   3   5   321.90037
4   326.17622   94.7089 9   8   325.34917
5   137.54006   325.34917   5   88  139.27264
6   258.06002   94.77531    6   63  259.29946
7   258.92824   322.20164   7   64  261.25563
8   98.57514    12.96828    8   34  94.77531
9   98.46303    139.27264   9   21  64.78909
10  317.22764   261.25563   10  97  322.20164