Question

我必须下载.gz格式的许多文件（一个文件〜40mb，40k行）。该文件包含来自另一个国家的数据，我只想从france -> fr中选择数据（限制列数）

我正在尝试使这一过程自动化，但是我在打开包装时遇到了问题。

数据在webpage上并且我对整个文件夹中的数据感兴趣。

我尝试：

创建临时文件
dowloand zip到tempfile
解压缩，读取和选择行。
另存为新文件，然后重复到下一个文件。

我想问这种想法是否正确。（下面的代码在for loop中）

temp <- tempfile()   
temp1 <- "C:/Users/tdo/Desktop/data/test.txt"

download.file("https://dumps.wikimedia.org/other/pageviews/2018/2018- 
06/pageviews-20180601-000000.gz",temp) # example

unzip(files = temp,exdir =  temp1)
data <- read.table(..)
daata[data$name == 'fr']
write.table(...)

通过这种方式，我创建了链接：

dumpList <- read_html("https://dumps.wikimedia.org/other/pageviews/2018/2018-04/")

links <- data_frame(filename = html_attr(html_nodes(dumpList, "a"), "href")) %>% 
filter(grepl(x = filename, "pageviews")) %>% data by project
mutate(link = paste0("https://dumps.wikimedia.org/other/pageviews/2018/2018-04/", filename))

Answer 1

为什么不直接读取压缩文件？如果您要做的只是对数据进行子集/筛选并存储为新的本地文件，我看不到需要对存档进行本地解压缩。

我建议使用readr::read_table2直接读取压缩文件。

这是一个最小的示例：

# List of files to download
# url is the link, target the local filename
lst.files <- list(
    list(
        url = "https://dumps.wikimedia.org/other/pageviews/2018/2018-06/pageviews-20180601-000000.gz",
        target = "pageviews-20180601-000000.gz"))

# Download gzipped files (only if file does not exist)
lapply(lst.files, function(x)
    if (!file.exists(x$target)) download.file(x$url, x$target))

# Open files
library(readr)
lst <- lapply(lst.files, function(x) {
    df <- read_table2(x$target)
    # Filter/subset entries
    # Write to file with write_delim
})

如何下载多个gzip文件？

1 个答案: