Question

第一次发帖，所以请善待。我正在阅读文件＆＃34; bar＆＃34;一次一行并使用sed替换＆＃34; foo＆＃34;中的每一行。（从第一行开始），从＆＃34; bar＆＃34;读取行。下面的代码可以工作，但是当＆＃34; foo＆＃34;是48,890行＆＃34; bar＆＃34;是~24,445行（正好是半个foo＆＃39; s长度）。

有没有人建议如何加快这个过程？

x=1
while read i;do
  sed -i "$x s/^.*$/$i/" foo
  x=$[$x +2]
done < bar

Answer 1

与paste和awk：

交错

paste -d '\n' bar <(awk 'NR%2==0' foo)

或者，如果没有流程替换：

awk 'NR%2==0' foo | paste -d '\n' bar -

替换foo：

paste -d '\n' bar <(awk 'NR%2==0' foo) > tmp && mv tmp foo

或

awk 'NR%2==0' foo | paste -d '\n' bar - > tmp && mv tmp foo

我进行了一些基准测试（只是执行时间，忽略了内存要求）。

创建输入文件（大约是问题中的十倍）：

$ dd if=/dev/urandom count=500000 | tr -cd [:alpha:] | fold -w 100 |
> sed 's/^/foo /' > foo
$ dd if=/dev/urandom count=250000 | tr -cd [:alpha:] | fold -w 100 |
> sed 's/^/bar /' > bar
$ wc -l foo bar
  539994 foo
  270126 bar
  810120 total

我使用time来衡量执行时间。所有解决方案的输出都重定向到新文件。结果以秒为单位，平均每次尝试五次：

codeforester            9.878
codeforester, mapfile   8.072
Fred                   17.332
Charles Duffy          'Argument list too long"
Claude                 27.448
Barmar                  0.298
Benjamin W.             0.176

查尔斯也以这里所用尺寸的10％进行了输入。

Answer 2

这是一个awk解决方案。它将所有bar读入数组。当它读取foo时，它会打印该数组的行或下一个元素，具体取决于它是奇数还是偶数。

awk 'BEGIN {index1 = 1}
     FNR == NR {file1[NR] = $0; next}
     NR % 2 == 1 { print file1[index1++]; next }
     { print }' bar foo > newfoo

Answer 3

我认为当前解决方案的缓慢是由sed所需的大量分叉以及重复重写文件导致的大量I / O引起的。这是一个零叉的纯Bash解决方案：

#!/bin/bash

# read "bar" file into an array - this should take less memory than "foo"
while read -r line; do
  bar_array+=("$line")
done < bar


# traverse "foo" file and replace odd lines with the lines from "bar"
# we don't need to read the whole file into memory
i=0
max_bar="${#bar_array[@]}"
while read -r line; do
  #
  # we look at bar_array only when we are within the limits of that file
  #
  p="$line"
  if ((i < max_bar && i % 2 == 0)); then
    p=${bar_array[$i]}
  fi
  printf "%s\n" "$p"
  ((i++))
done < foo

示例运行：

栏的内容：

foo的内容：

输出：

使用Bash 4及更高版本，读取语句

while read -r line; do
  bar_array+=("$line")
done < bar

也可以写成：

mapfile -t bar_array < bar

Answer 4

其他答案建议基于将整个文件存储在数组中的方法。根据文件大小，这在某些方面会有一些实际限制。

另一种方法是简单地从两个文件中读取，一次一行，在单独的文件描述符中打开它们。

#!/bin/bash

exec 3< foo
exec 4< bar

eof_bar=0
eof_foo=0

while [[ $eof_bar = 0 ]]
do
   # Foo line we keep
   IFS= read -r -u 3 foo_line || eof_foo=$?
   [[ "$eof_foo" != 0 ]] || [[ -n "$foo_line" ]] || break
   printf "%s\n" "$foo_line"
   # Bar line we will replace with
   IFS= read -r -u 4 bar_line || eof_bar=$?
   [[ "$eof_bar" = 0 ]] || [[ -n "$bar_line" ]] || break
   # Foo line we skip (line from bar was present)
   IFS= read -r -u 3 foo_line
   [[ "$eof_foo" != 0 ]] || [[ -n "$foo_line" ]] || break
   # Actual replacement (both files had required lines)
   printf "%s\n" "$bar_line"
done

# Cat the rest of the lines from foo (if any), if bar did not
# have enough lines compared to foo
cat <&3

# Close file descriptors
exec 3>&-
exec 4>&-

代码从foo为bar的每一行读取foo两行，并简单地跳过从每次迭代读取的var operators = new List<string>() { "<", "<=", ">", ">=" }; string s = Console.ReadLine(); if (operators.Contains(s)) { //user entered operator } else { //not operator }开始的第二行。

这样做会占用很少的内存，因此可以处理任意大小的文件。

Answer 5

awk似乎是最好的选择，因为它不会在每一行创建子shell进行读取，它会在一个进程中对所有文件进行修改/复杂化很少

# Oneliner for batch or command line
awk 'FNR==NR{b[NR]=$0;next}{if(NR%2==1)$0=b[((NR+1)/2)];print}' bar foo

相同的代码，但自我评论理解

awk '# when reading first file (bar)
     FNR == NR {
        # load line content into an array
        bar[ NR] = $0
        # cycle to next line (don't go further in the code for this input line)
        next
        }

     # every line from other files (only foo here)
     {
        # every odd line, replace content with corresponding array content
        # NR = record line and is odd so (NR + 1) / 2 -> half the line number uprounded
        if (NR % 2 == 1) $0 = bar [ ( ( NR + 1 ) / 2)]

        # print the line (modified or not)
        print
     }
    ' bar foo

Answer 6

在一次调用中运行所有sed命令，并且只重写foo一次，而不是每行bar重写一次。

x=1
sed_exprs=( )
while IFS= read -r i; do
  sed_exprs+=( -e "$x s/^.*$/$i/" )
  x=$(( x + 2 ))
done < bar

sed "${sed_exprs[@]}" -i foo

Answer 7

这是一个可以使用小型常量内存工作的流媒体解决方案，以防万一你在RAM很少的机器上有大量文件。

#!/bin/bash

# duplicate lines in bar to standard output
paste -d '\n' bar bar |

# pair line-by-line foo with lines from previous command
paste -d '|' foo - |

# now the stream is like:
#  foo line 1|bar line 1
#  foo line 2|bar line 1
#  foo line 3|bar line 2
#  foo line 4|bar line 2
#  foo line 5|bar line 3
#  ...
{
  # set field separator to correspond with previous paste delimiter
  IFS='|'
  # read pairs of lines, discarding the second
  while read -r foo bar && read -r junk
  do
    # print the odd lines from foo
    printf "%s\n" "$foo"
    # interleaved with the lines from bar
    printf "%s\n" "$bar"
  done
}

您必须选择|中不会出现的分隔符（此处为foo）。经测试：

paste (GNU coreutils) 8.26

Answer 8

这是我的第一个答案的大量修改版本，我将根据提交的基准单独发布。

#!/bin/bash
exec 3< foo
exec 4< bar
eof=0
IFS=
n=$'\n'
while :
do
   readarray -n 2 -u 3 fl && read -r -u 4 bl || break
   echo "${fl[1]}$bl"
done
# Add remaining data
[[ -n ${fl[1]} ]] || echo "$fl"
[[ -n $bl ]] || echo "$bl"
# Cat the rest of the lines from foo (if any), if bar did not
# have enough lines compared to foo
cat <&3
# Close file descriptors
exec 3>&-
exec 4>&-

原来我的“手动优化”解决方案比我的第一个版本更简单，更易读，这表明考虑速度有时会带来简化，这总是好的。

在我的机器上，我的第一个答案的测试大约与基准测试同时进行，并且这个新答案在不到7秒的时间内完成，这相当快，但没有像{{{{ 1}}解决方案，当然。

修改

我用一个readarray替换了“foo”中的两个读数，它将运行时间（在我的机器上）从大约9秒减少到7以下，比我想象的要多。这使我认为通过读取数组中的两个文件（但不是整个文件以避免达到内存限制的风险）可以做出重大改进，显然是以额外的代码复杂性为代价。

加速从文件中读取的sed替换字符串

8 个答案: