Question

我有一个包含许多行（约4,000万行）的文件，我试图将其拆分用于某些下游流程。该文件看起来像这样

a
b
c
d
e

我想通过添加字符串＆＃39;＆gt; n＆＃39;来分解文件。每行1M线新线。出于这些目的，2行示例可以。我希望我的最终输出是

a
b
>1
c
d 
>2
e

我非常确定sed可以做到这一点，但我无法设法让越来越多的部分得到解决。

Answer 1

@Stephen：试试：

awk -v num=2 'FNR % num == 0 {print $0 ORS ">"++q ;next} 1'  Input_file

同样，您可以在上面提供您的行号，然后可以在输出中打印它。此外，我已将FNR用于查找行数，以防用户使用多个Input_files，因此每次下一个文件时它将重置FNR的值，它将从头开始为下一个Input_file（NR不执行此操作）。 / p>

编辑：现在添加完整的代码说明。

awk -v num=2           #### Setting a variable named num to value 2 here.
'FNR % num == 0        #### Checking condition if FNR%num==0 is TRUE then it should perform following actions. Where FNR is awk built-in keyword to get the line number, only difference between FNR and NR is FNR gets RESET whenever a new Input_file gets read. As we know awk could read multiple Input_files, so
                            in this case FNR could be really helpful compare to NR.
{print $0 ORS ">"++q ; #### printing the current line's value(off course when above condition is TRUE) with ORS(output field separator) whose default value is new line and then printing ">" and a variable named q whose value will always increase each time cursor comes in this section.
next}                  #### mentioning next keyword here which will help us to skip all other further statements when this condition met so that we could save our time.
1                      #### awk works on condition then action pattern so here by putting 1 I am making condition as TRUE and then specifying no action so by default print will happen which will print the entire line.
'  Input_file          #### mentioning the Input_file here.

Answer 2

awk是更好的选择。

这个插入你喜欢的行

awk 'BEGIN{i=0}; {if ((NR-1) % 1000000 == 0) {i++; print ">" i}}; {print}' your_file > another_file

这个文件拆分文件＆＃34; your_file＆＃34;直接进入名为＆＃34; your_file1＆＃34;，＆＃34; your_file2＆＃34;等文件。

awk 'BEGIN{i=0}; {if ((NR-1) % 1000000 == 0) {i++}} {print > "your_file" i}' your_file

Answer 3

这可能适合你（GNU sed）：

 seq -f'>%g' 1000000 | sed '0~1000000R /dev/stdin' file

这使用seq生成一系列您认为必要的文件分割器，然后使用模运算from~step将它们插入到输入文件中。

另一种方式，完全是sed但不推荐的是：

sed -r '0~1000000!b;p;x;s/^9*$/0&/;:a;s/9(x*)$/x\1/;ta;s/$/#0123456789/;s/(.)(x*)#.*\1(.).*/\3\2/;s/x/0/g;h;s/^/>/' file

这使用相同的模运算，然后将计数器保留在保留空间中，并在将其插入输出文件之前递增它。

然而。由于本练习的真正目的是将split大文件转换为较小的文件，为什么不使用split？

split -a1 --numeric-suffixes=1 -l 1000000  file '>'

这会将文件拆分为名为>1 .. >n的文件，每个文件都有一百万行。

Answer 4

我不认为sed可以自己完成所有这些，因为（AFAIK）它无法处理变量，但是awk可以。您可以使用以下脚本

BEGIN {
    id=0;    
}

{
    if (NR % nth == 0) {
        print ">"id;
        id++;    
    } else {
        print $0
    }
}

END {}

然后以这种方式执行：

awk -v nth=<your N value> -f /script/name > /new/file

Answer 5

我用一个简单的shell脚本（upline.sh）来做这件事：

EVERYLINE=2

LINECOUNT=0
COUNTER=1

#read file line by line
while read LINE; do

    #print current line
    echo $LINE

    #increment linecounter
    ((LINECOUNT++))

    #check if we have to insert an additional line
    if [ $LINECOUNT -eq $EVERYLINE ]; then
        #print additional line
        echo ">n$COUNTER"

        #increment counter for additional line
        ((COUNTER++))

        #reset linecounter
        LINECOUNT=0
    fi
done

以

开头

bash upline.sh < youtdatafile.txt

变量“EVERYLINE”控制插入额外行的行数。你也可以使用

EVERYLINE=$1

将“拆分号码”作为参数。

Linux：添加带字符的新行，并将数字增加到文件

5 个答案: