Question

我想知道一列中的元素在另一列中出现的频率

假设我有两列。在这些列中，我有一系列的时间（以毫秒为单位）。一栏代表金标准（方法2）和新颖方法（方法1）。我想创建一个函数，在其中可以读取具有2列（方法1和方法2）的任何csv，并且我希望它计算方法1中的时间出现在方法2中的频率。此外，因为我们处理的是毫秒，所以我希望具有很小的公差，即，如果它的公差为0.005毫秒，我也希望它的公差为+/- 0.002（因此它的公差范围为0.003至0.007）。

我的终点是输出一个csv，该csv添加了第三列，该列显示了method1中method1出现在method2中的频率。

我尝试了以下代码：

df<-read.csv("/Users/user/Desktop/R_workingdir/test__test.csv")

method1<-df$method1
method2<-df$method2

method1<-toString(method1)
method2<-toString(method2)

summarise(group_by(df,method1,method2),count =n())

...但是它只计算同一列中的频率：

Please find attached the image of the expected input, output and what I am currently getting

PS。我是RStudio的新手，所以如果您能对代码进行简短的解释以使我理解，那就太好了。

Answer 1

df = read.delim("./temp.tsv") #you seem to have a tab separated file, at least according to your screen shot
tolerance = 0.002 #set the tolerance
counts = sapply(
  df$Method2,
  #input values for the comparisons (this will substitute the 'x' in the function below)
  FUN = function(x) {
    #we define a comparison function on the fly
    sum(df$Method1 >= x - tolerance &
          df$Method1 <= x + tolerance) #sum the times a value is true, i.e. falls into the specified range
  }
)

output_df = cbind(df, counts) #that just binds the columns together into one data frame

write.csv（output_df，“ output.csv”）＃请注意，这将写入真实的csv文件，而不是制表符分隔的文件，因为您输入的内容似乎是

我不确定您输入的内容是数字还是后面带点的数字。对于方法2的第3个值，此值（应）为Count = 0，因为方法1中没有0.012到0.016之间的值

如何获得一个元素在一个范围内的另一列中出现的次数？

1 个答案: