主文件：

Question

我是shell脚本的新手，需要您就典型要求提供指导。我有两个文件（1.master文件和2.pattern文件）主文件包含许多带|的字段分隔符，只需要根据模式文件更新第10和第15个字段。

主文件：

H|20170101

123|field2|field3|...|field10|field11...|field15|....|field150

...

...

T|1000000

模式文件：

Europe|EU

Australia|AU

China|CN

例如，

123|1|2|3|...|9|nice weather in europe today|11|.....

上述行需要更换为

123|1|2|3|...|9|nice weather in EU today|11|.....

我从一个简单的sed命令开始，通过从模式文件中获取值来替换主文件..但它不完整，因为我不知道如何处理一个巨大的主文件而这也取代了特定的领域。

while read line

do

value1=$(echo $line | awk -F"|" '{print $1}')

value2=$(echo $line | awk -F"|" '{print $2}')

sed -i 's/ '${value1}' /'${value2}'/g' master.txt

done < pattern.txt

对于10mb文件，上面的脚本非常慢，因为我的主文件有点大（100 mb）。

请帮忙。

Answer 1

由于您正在创建的子进程数，脚本可能慢。此外，您正在读取较大文件（master.txt）的次数，而不是较小的文件。

请注意-i的{{1}}选项是非标准的。

您可以使用sed摆脱对awk语言解释程序和sed编辑器的调用：

bash

这不允许仅编辑某些字段。这样做：

# Read patterns into an associative array
# Requites Bash 4 or later
declare -A patterns

while IFS='|' read key value
do
    patterns[$key]="$value"

done < pattern.txt 

# Set the option for case insensitive patterns
shopt -s nocasematch

while read line
do
    # Iterate through the patterns array
    for key in "${!patterns[@]}"
    do 
        line="${line//$key/${patterns[$key]}}"
    done  

    echo "$line"

done < master.txt

Answer 2

这是一个sed替代提案，基于sed可以从文件中读取命令的事实。

首先，我使用模式文件的内容创建一个sed命令文件：

$ cat file1
europe|EU
australia|AU
china|CN

$ while IFS="|" read -r a b;do 
> echo -e "s/((.[^|]*.){9})(.+)\<$a\>([^|]+)(.*)/\1\3$b\4\5/g";
> echo -e "s/((.[^|]*.){14})(.+)\<$a\>([^|]+)(.*)/\1\3$b\4\5/g";
> done<file1 >file11

$ cat file11
s/((.[^|]*.){9})(.+)\<europe\>([^|]+)(.*)/\1\3EU\4\5/g
s/((.[^|]*.){14})(.+)\<europe\>([^|]+)(.*)/\1\3EU\4\5/g
s/((.[^|]*.){9})(.+)\<australia\>([^|]+)(.*)/\1\3AU\4\5/g
s/((.[^|]*.){14})(.+)\<australia\>([^|]+)(.*)/\1\3AU\4\5/g
s/((.[^|]*.){9})(.+)\<china\>([^|]+)(.*)/\1\3CN\4\5/g
s/((.[^|]*.){14})(.+)\<china\>([^|]+)(.*)/\1\3CN\4\5/g

然后，我们唯一要做的就是使用上面的命令file11调用sed和feed sed。

$ cat file2
1|2|3|4|5|europe|7|8|9|nice weather in europe today|11|12|europe|14|nice weather in europe today|16
1|2|3|4|5|europe|7|8|9|nice european weather today|11|12|europe|14|nice european weather today|16
1|2|3|4|5|europe|7|8|9|nice weather in china today|11|12|china|14|nice weather in china today|16
1|2|3|4|5|europe|7|8|9|nice weather in china today|11|12|china|14|best of chinas today|16
1|2|3|4|5|europe|7|8|9|nice weather in australia today|11|12|australia|14|nice weather in australia today|16

我有fullfilled file2，其中包含各种测试值，并确保所提供的sed正则表达式仅替换第10和第15个字段，并且仅当我们有字面匹配时（即单词europe替换为{ {1}}，但EU字未被替换）

这些结果似乎相当不错。我希望这个sed解决方案对你的大文件非常快。

european

Answer 3

这是一次在黑暗中拍摄，因为您的样本数据甚至没有10个字段，我没有时间创建测试集。希望它有效，使用awk。下次，请充分考虑创建工作数据集（足够的字段，Europe = / = europe等）。 LIke我说，未经测试：

$ awk '
BEGIN { FS=OFS="|" }                      # delimiters
NR==FNR { a[$1]=$2; next }                # read patterns and hash them
{
    for(i=10;i<=NF;i+=5)                  # iterate every fifth field
        if(i%10==0||i%15==0){             # pick only mod 10 and mod 15
            n=split($i,b," ")             # split to b the chosen ones
            for(j=1;j<=n;j++)             # iterate thru the chosen ones
                if(b[j] in a)             # if word is found among patterns
                    sub(b[j],a[b[j]],$i)  # switch the matching pattern
        }
}1' pattern master

根据file1在file2中搜索字符串并替换

主文件：

模式文件：

3 个答案: