选择CSV文件并成对读取

时间:2019-09-11 06:03:09

标签: r

我一次比较两对csv文件。我每个文件的末尾都有一个数字,例如 cars_file2.csv Lorries_file3.csv computers_file4.csv phones_file5.csv < / em>。我每个文件夹有70个文件,比较的方式是,比较 cars_file2.csv Lorries_file3.csv ,然后比较 Lorries_file3.csv computers_file4.csv ,其格式为 2,3,3,4,4,5 。有没有一种聪明的方法可以处理此问题,而不是像我在此处阅读方法那样手动返回并更改文件,或者可以使用每个csv上的最后一个数字来聪明地读取它们。 请注意:文件具有相同的后缀 _file

library(daff)

setwd("path")

# Load csvs to compare into data frames
x_original <- read.csv("cars_file2.csv", strip.white=TRUE, stringsAsFactors = FALSE)

x_changed <- read.csv("Lorries_file3.csv", strip.white=TRUE, stringsAsFactors = FALSE)

render(diff_data(x_original,x_changed ,ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE))

我的目的是比较每两对csv和记录的字段添加,删除和修改

3 个答案:

答案 0 :(得分:1)

您可能希望一次加载所有文件,然后使用完整的文件列表进行比较。 这可能会帮助:

# your path
path <- "insert your path"

# get folders in this path
dir_data <- as.list(list.dirs(path))

# get all filenames
dir_data <- lapply(dir_data,function(x){

  # list of folders
  files <- list.files(x)
  files <- paste(x,files,sep="/")

  # only .csv files
  files <- files[substring(files,nchar(files)-3,nchar(files)) %in% ".csv"]

  # remove possible errors
  files <- files[!is.na(files)]

  # save if there are files
  if(length(files) >= 1){
    return(files)  
  }
})

# delete NULL-values
dir_data <- compact(dir_data)

# make it a named vector
dir_data <- unique(unlist(dir_data))
names(dir_data) <- sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(dir_data))
names(dir_data) <- as.numeric(substring(names(dir_data),nchar(names(dir_data)),nchar(names(dir_data))))

# remove possible NULL-values
dir_data <- dir_data[!is.na(names(dir_data))]

# make it a list again
dir_data <- as.list(dir_data)

# load data
data_upload <- lapply(dir_data,function(x){
  if(file.exists(x)){
    data <- read.csv(x,header=T,sep=";")
  }else{
    data <- "file not found"
  }
  return(data)
})

# setup for comparison
diffs <- lapply(as.character(sort(as.numeric(names(data_upload)))),function(x){

  # check if the second dataset exists
  if(as.character(as.numeric(x)+1) %in% names(data_upload)){

    # first dataset
    print(data_upload[[x]])

    # second dataset
    print(data_upload[[as.character(as.numeric(x)+1)]])

    # do your operations here
    comparison <- render(diff_data(data_upload[[x]],
                     data_upload[[as.character(as.numeric(x)+1)]],
                     ignore_whitespace=T,count_like_a_spreadsheet = F))
    numbers <- c(x, as.numeric(x)+1)

    # save both the comparison data and the numbers of the datasets
    return(list(comparison,numbers))

  }
})

# you can find the differences here
diffs

此脚本将所有csv文件加载到一个文件夹及其子文件夹中,并按编号将它们放入列表中。如果没有双打,这将起作用。如果您有双打,则必须调整矢量的命名部分,以便以后可以索引文件的全名。

答案 1 :(得分:1)

使用for的简单paste-循环将读取对:

for (i in 1:70) { # assuming the last pair is cars_file70.csv and Lorries_file71.csv
  x_original <- read.csv(paste0("cars_file",i,".csv"), strip.white=TRUE, stringsAsFactors = FALSE)
  x_changed <- read.csv(paste0("Lorries_file3",i+1,".csv"), strip.white=TRUE, stringsAsFactors = FALSE)
  render(diff_data(x_original,x_changed ,ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE))
}

答案 2 :(得分:1)

为简单起见,我使用了2个.csv文件。

csv_1

1,2,4

csv_2

1,8,10

从文件夹中加载所有.csv文件,

files <- dir("Your folder path", pattern = '\\.csv', full.names = TRUE)
tables <- lapply(files, read.csv)

#create empty list to store comparison output
diff <- c()

浏览所有已加载的文件并进行比较,

for (pos in 1:length(csv)) {
  if (pos != length(csv)) { #ignore last one
    #save comparison output
    diff[[pos]] <- diff_data(as.data.frame(csv[pos]), as.data.frame(csv[pos + 1]), ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE)
  }
}

通过diff

比较输出
[[1]]
Daff Comparison: ‘as.data.frame(tables[pos])’ vs. ‘as.data.frame(tables[pos + 1])’ 
      +++ +++ --- ---
@@ X1 X8  X10 X2  X4