Question

我需要以下列方式加入大量数据文件。

第1步。

数据需要在＆＃39;位置内连接，这些位置烦恼地分为两部分（A部分和B部分）。这些文件存储在一个文件夹中，其中包含随机的，无关紧要的位置编号，例如：

PartA_location012843.csv
PartB_location012843.csv
PartA_location465475.csv
PartB_location465475.csv
...

有没有办法循环遍历所有文件，在每个位置加入A和B部分，而无需手动指定位置ID号？然后，联接将是来自left_join(PartA_locationX, PartB_locationX, by='common_field')的简单dplyr。我猜测输出将是R工作空间中的一系列数据框架对象，每个位置对应一个：

location012843
location465475
...

第2步。

接下来，所有位置需要与rjoin一起附加到一个数据框，保留location_id，以便：

     location_id fieldA fieldB common_field
1 location012843      x      y            c
2 location012843      x      y            c
...

Answer 1

您没有提供大量详细信息，因此此代码假定每个位置都有两个CSV，并且没有丢失的CSV文件。它还假设所有位置代码都是六位数。这会为您在步骤2结尾处指定的所有位置创建一个data.frame，并跳过为步骤1中提到的每个位置分别创建数据框 - 如果您需要那些filter他们以后出去了。

library(dplyr)
library(stringr)

# Create list of CSV files to pull in
a_files <- list.files("your_folder/", pattern = "PartA")
b_files <- list.files("your_folder/", pattern = "PartB")

# Create df for final output
final_df < - data.frame()

for (i in seq_along(a_files)) {

  # Extract location name from PartA file name
  loc_a <- str_extract(a_files[i], "location[0-9]{6}")

  # Read in CSVs and store location as variable
  parta <- read.csv(a_files[i]) %>%
    mutate(location_id = str_extract(a_files[i], "location[0-9]{6}"))
  partb <- read.csv(b_files[i]) %>%
    mutate(location_id = str_extract(b_files[i], "location[0-9]{6}"))

  # Join on common field and location 
  # Will throw errors if the locations are off in parta and partb
  final_df <- left_join(parta, partb, by= c('common_field', 'location_id')) %>%
    bind_rows(final_df)

}

自动化R中的文件加入/合并

1 个答案: