Question

我有一个包含逗号的多行文件。我想删除该行逗号之后出现的所有字符，包括逗号。我有一个bash脚本来执行此操作，但速度不够快。

输入：

hello world, def

输出：

hllo worl

我的慢脚本：

#!/bin/bash

while read line; do
    values="${line#*, }"
    phrase="${line%, *}"
    echo "${phrase//[$values]}"
done < "$1"

我想改善表现。有什么建议吗？

Answer 1

使用Perl

$ perl -F',' -lane '$F[0] =~ s/[$F[1]]//g; print $F[0]' file
hlloworl

如果您不想在逗号后计算空格：

$ perl -F',\s*' -lane '$F[0] =~ s/[$F[1]]//g; print $F[0]' file
hllo worl

Perl擅长像这样的文本操作，所以我希望这很快。

Answer 2

摆脱while循环可以为您的代码提供支持，大多数程序将文件作为输入并为您执行阅读。

您可以使用以下内容替换您的程序并报告时间：

cut -d"," -f1 < file

您可以尝试使用awk，将字段分隔符更改为,：

awk 'BEGIN {FS=","}; {print $1}' file

您也可以尝试使用sed（@Qualia建议的修改）：

sed -r -i "s/,.*//g" file

请注意，-i标志将会编辑您的文件，如果这不是您想要的效果：

sed -r "s/,.*//g" file

Answer 3

AWK解决方案（从@glenn jackman的perl解决方案中获取灵感编辑）：

awk -F", " '{ gsub("["$2"]",""); print $1 }' "$1"

通过这种线处理，使用已编译的解决方案通常会更好。我会用Haskell表达它的意思：

-- answer.hs
import Data.List(nub, delete)
import Data.Char(isSpace)
main = interact (unlines . (map perLine) . lines)
perLine = strSetDiff . break (==',')
strSetDiff (s, ',':' ':sub) = filter (`notElem` sub)) s
strSetDiff (s, _) = s

使用命令ghc -O2 answer.hs进行编译。

此break分为s上的两个列表sub和,，从", "删除sub，然后过滤s删除sub元素的字符。如果没有逗号，则结果是整行。

这假设空格始终跟在,之后。否则，请移除' ':并将notElem sub替换为notElem (dropWhile isSpace sub)

包含10行重复8000次的80000行文件所花费的时间：

$ time ./answer <infile >outfile
0.38s user 0.00s system 99% cpu 0.386 total

$ time [glenn jackman\'s perl]
0.68s user 0.00s system 99% cpu 0.691 total

$ time awk -F", " '{ gsub("["$2"]",""); print $1 }' infile > outfile
0.85s user 0.04s system 99% cpu 0.897 total

$  time ./ElBarajas.sh infile > outfile
2.77s user 0.32s system 99% cpu 3.105 total

就个人而言，我愿意承认失败 - 对我来说，perl解决方案似乎是最好的。

While和Read的性能问题

3 个答案: