Question

我在一个文件夹中有几个tsv文件，总计超过50 GB。为了在将这些文件加载到R中时使内存更容易，我想只提取这些文件的前3列。

如何在终端中一次性提取所有文件的列？我正在运行Ubuntu 16.04。

Answer 1

以下内容应该有效：

#!/bin/bash
FILES=/path/to/*
for f in $FILES
do
    # Do something for each file. In our case, echo the first three fields to a new file, and rename the new file to the original file:
    cut -f1-3 < "$f" > "$f.tmp"
    rm "$f"
    mv "$f.tmp" "$f"
done

（有关在bash中迭代文件的详细信息，请参阅this webpage。）

M. Becerra的答案包含一个单行程，其中使用find命令可以实现相同的目的。因此，除非你想对每个文件进行额外的处理（例如，在迭代文件时构造一些统计信息），否则我自己的答案可能会被认为比必要的更复杂。

编辑：如果要覆盖实际文件，可以改用以下脚本：

cut

.tmp行将其输出写入原始文件名，并附加$(function() { //don't declare click handler until document is fully ready and all HTML elements are loaded. $('#btnNext').click(function(event) { event.preventDefault(); //prevent default button postback behaviour, if any var regStatus=''; var LN = document.getElementById('studRegLN').value; var FN = document.getElementById('studRegFN').value; $.ajax({ url: 'ajax/userValidation.php', data: { //let jQuery do the URL-encoding for you "userLN": LN, "userFN": FN }, type: 'GET', dataType: 'json', success: function(data) { regStatus = data; alert(regStatus); //alert must be in here to definitely delay executing the alert until the variable is populated from the asynchronous server response }, error: function(jqXHR) { //handle ajax errors alert("Error: " + jqXHR.responseText); } }); }); });;以下两行删除原始文件，并将新文件重命名为原始文件名。

Answer 2

这看起来像是cut实用程序

的完美用例

您可以按如下方式使用它：

cut -d$"\t" -f 1-3 folder/*

-d指定字段分隔符（在本例中为选项卡），-f指定要提取的字段，folder/*是glob，指定要解析的所有文件。

Answer 3

直接在R中执行 - 这将节省时间，磁盘空间和代码：

fread("foo.tsv", sep = "\t", select=c("f1", "f2", "f3"))

Answer 4

你可以这样做：

find ./ -type f -name ".tsv" -exec awk '{ print $1,$2,$3 }' {} \;

您可以从拥有文件的目录运行它，或者只添加绝对路径。

如果您想将其保存到文件中，可以重定向awk的输出：

find ./ -type f -name ".tsv" -exec awk '{ print $1,$2,$3 }' {} >> someOtherFile \;

Answer 5

有几种方法可以直接在R中执行此操作，具体取决于安装的软件包。这些方法都可以将内存使用降至最低。

使用基础（默认）包（创建data.frame）：

> df1 = read.table(pipe("cut -f 1-3 *.tsv"), sep="\t", header=FALSE, quote="")

使用tidyverse/readr包（创建tibble）：

> df2 = read_tsv(pipe("cut -f 1-3 *.tsv"))

使用data.table（创建data.table或可选的data.frame）：

> df3 = fread("cut -f 1-3 *.tsv")

这些技术中的每一种都调用unix shell命令，读取命令的输出。这最大限度地减少了内存使可以使用任意shell管道，因此可以组合其他命令。例如，要获得10,000行的随机集合：

> df4 = fread("cut -f 1,3 *.tsv | shuf -n 10000")

这些方法中的每一种都有一整套用于自定义输入过程的选项。

从文件夹中的所有tsv文件中提取前三列

5 个答案: