Question

我会对以下问题提供一些帮助：

我有多个巨大的日志文件（每个＆gt; 1.000.000条目），其中包含一些我特别感兴趣的行（行）。所以我想创建一个只包含这些行的子集，但我想将结果写入包含多个Logfile / Participant的信息的矩阵中。所以我创建了一小段代码来创建子集，然后在循环中运行它，不仅可以用于其中一个日志文件，还可以用于所有日志文件。

  Result <- subset(df, df$columnOfInterest== "interestingCondition1" | df$columnOfInterest== "interestingCondition2" | df$columnOfInterest== "interestingCondition3")$columnOfInterest
  View(Result)

1
interestingCondition1
2
interestingCondition1
3
interestingCondition2
4
interestingCondition1
5
interestingCondition1
6
interestingCondition3
7
interestingCondition2
8
interestingCondition1
9
interestingCondition1
10
interestingCondition1

嵌入循环：

WrongResult <- matrix(data=NA,nrow=TrialNumber, ncol=length(ListOfFiles))
vpncount <- 1
for (v in ListOfFiles){

  df<- read.delim(v, header = TRUE, sep='\t')
  WrongResult[,vpncount] <- subset(df, df$columnOfInterest== "interestingCondition1" | df$columnOfInterest== "interestingCondition2" | df$columnOfInterest== "interestingCondition3")$columnOfInterest

vpncount <- vpncount+1

}

当在一个日志文件上运行代码时，我得到了我想要的结果，但是当它通过循环运行时，它会创建一个具有适当大小的矩阵，但只是填充了＆＃34;随机＆＃34;数字而不是我为之细分的条件。

有谁知道为什么会发生这种情况以及如何解决这个问题？任何帮助都很受欢迎！

编辑：

我尝试创建一个示例数据框。第一行代码（包括变量Results）就像我希望的那样工作。它在我的columnOfInterest行上过滤我的数据帧，并将它们放入一个新的矩阵中。但是，如果我尝试在循环中运行多个数据帧，我会继续遇到错误：

df <- data.frame(
  X = sample(1:10),
  columnOfInterest= sample(c("interestingCondition1", "interestingCondition2", "interestingCondition3", "NotinterestingCondition1"), 10, replace = TRUE)
)

View(df)

Result <- subset(df, df$columnOfInterest== "interestingCondition1" | df$columnOfInterest== "interestingCondition2" | df$columnOfInterest== "interestingCondition3")$columnOfInterest
View(Result)

WrongResult <- matrix(data=NA,nrow=280, ncol=20)
vpncount <- 1
for (v in 1:20){

  df<- read.delim(v, header = TRUE, sep='\t')
  WrongResult[,vpncount] <- subset(df, df$columnOfInterest== "interestingCondition1" | df$columnOfInterest== "interestingCondition2" | df$columnOfInterest== "interestingCondition3")$columnOfInterest

  vpncount <- vpncount+1

}

View(WrongResult)

Answer 1

我不记得如何使用data.frame这样做，所以我将尝试使用data.table。您可能必须安装data.table包，以防您没有install.packages("data.table")

library(data.table)
dt <- data.table(df)

然后你可以用以下方式重写你的代码

subset..table <- function(dt){
    dt[columnOfInterest %in% c("interestingCondition1",
                               "interestingCondition2",
                               "interestingCondition3"),columnOfInterest]
}


myfun <- function(x){
### DD
    ## x interp string representing  file name

### Purpose
    ## read and subset

    dt <- fread(x,header=TRUE,sep="\t")
    subset..table(dt)

}

res..list <- lapply(ListOfFiles, myfun)

修改

例如使用你的例子。

df <- data.frame( X = sample(1:10), columnOfInterest= sample(c("interestingCondition1", "interestingCondition2", "interestingCondition3", "NotinterestingCondition1"), 10, replace = TRUE)) dt <- data.table(df) subset..table(dt)

会产生

#[1] "interestingCondition2" "interestingCondition3" "interestingCondition1" #[4] "interestingCondition2" "interestingCondition1" "interestingCondition2" #[7] "interestingCondition3" "interestingCondition1" "interestingCondition3"

如果您对函数subset..table感到满意，那么您只需使用函数myfun即可获得所需内容。函数fread会自动为您提供data.table。

Answer 2

在tidyverse领域，当您处理单个数据框时，您希望filter()然后select()原始数据，为方便起见，请使用{{1 ，文件名。有多个可能值时过滤的好方法是使用mutate()。所以

%in%

library(tidyverse) process_1_df <- function(df, id, condition) select(df, columnOfInterest) %>% # only interesting column filter(columnOfInterest %in% condition) %>% # specific rows mutate(id = id) # add identifier condition <- paste0("interestingCondition", 1:3) process_1_df(df, "id", condition)是一个标识符 - 如果data.frame来自文件＆＃39; foo.txt＆＃39;，则使用id作为ID。原始问题试图将来自多个文件的数据表示为矩阵，但这假设每个文件都选择了相同数量的有趣行。这里的策略是创建一个数据框，其中包含有趣条件来自的文件，以及有趣条件的值。处理多个文件时，此数据框非常有用......

这适用于样本数据集：

"foo.txt"

您可以对此进行扩展以处理文件

> condition <- paste0("interestingCondition", 1:3)
> process_1_df(df, "id", condition)
       columnOfInterest id
1 interestingCondition2 id
2 interestingCondition2 id
3 interestingCondition3 id
4 interestingCondition1 id
5 interestingCondition3 id
6 interestingCondition1 id

正如@DJJ建议的那样，process_1_file <- function(file_name, condition) read_csv(file_name) %>% # better: input only columnOfInterest process_1_df(file_name, condition)的data.table实现可能非常紧凑和高效 - process_1_file()

要处理多个文件，请使用purr包

fread(file_name)[columnOfInterest %in% condition, columnOfInterest]

最终结果是单个数据框，其中包含一列有趣的条件，另一列指示有趣条件来自哪个日志文件。现在可以根据需要处理/汇总这个“长”格式数据框。

Answer 3

有谁知道为什么会这样？

你的循环......不起作用。原因有点复杂，但我在基础R中使用简单的循环（没有*应用函数）制作了一个工作示例，希望你可以跟随，并希望它在足够的程度上代表你的问题。

在跑步前学会走路。在学习如何使用apply()，lapply()等更简洁地学习之前学习基本循环。在深入研究非标准评估之前，学习标准评估（即经常使用编程语言R本身）（{ {1}}，data.table，tidyverse等。）

首先，我们将创建一些数据框并将其写入文件

purrr

运行此文件后，您应该有一个名为“sotest”的文件夹，其中包含三个以制表符分隔的文本文件。

然后我们将获得一个可用文件列表，并循环显示。

owd <- getwd()
dir.create("sotest")
setwd("sotest")

set.seed(1)

flist <- c("dtf1.txt", "dtf2.txt", "dtf3.txt")

for (i in 1:length(flist)) {
    dtf <- data.frame(
      X=sample(1:10),
      coi=sample(c("ic1", "ic2", "ic3", "nic1"), 10, replace=TRUE)
    )
    write.table(dtf, flist[i], row.names=FALSE, sep="\t")
}

我将输出存储为列表而不是矩阵，因为在循环的每次迭代中生成的对象的长度都不相同。

如何在sapply函数上设置循环以为多个参与者创建子集？

3 个答案: