编辑

Question

我正在尝试以允许第二个文件中的行以以开头的第一个文件中的行（但附加了不需要的垃圾）的方式比较两个文件。

考虑以下代码：

printf '%s\n' 5234 2234 3234 4234 1234 >NumsOnFile.txt
printf '%s\n' 423499 1234 223401 3234 >UserNums.txt

我想生成两个输出文件：good.txt，两个文件中都有数字（甚至只是一个子字符串），以及bad.txt，其中数字在UserNums.txt中但不存在NumsOnFile.txt。

现有实施尝试

阶段1：消除已经正确的行

我目前分两个阶段进行。我目前在第一阶段的尝试如下所示：

sort -n UserNums.txt > a 
sort -n NumsOnFile.txt > b
awk '!a[$0]++' a > A
awk '!a[$0]++' b > B
comm -23 A B > bad.txt  
comm -12 A B > good.txt

我希望good.txt包含以下内容：

1234
3234

...和bad.txt包含以下内容：

423499
223401

阶段2：尝试查找子字符串

然后，我正在处理bad.txt，查看在删除每行的最后一个字符之后是否找到任何匹配项：

read file
if [ -s bad.txt ]
   then 
    sed 's/.$//' bad.txt > checker.txt # removes last character from each line
    sort -n checker.txt > X
    comm -23 X B > checker.txt 
    comm -12 X B >> good.txt
    cat checker.txt > bad.txt 
else
    echo "File is empty"
fi

在第二阶段之后，good.txt现在应该具有与两个文件都匹配的所有数字（即使它们只是UserNums.txt中的子字符串）：

...而bad.txt应该具有不匹配的原始数字：

423499
223401

这是怎么了？

我认为我的逻辑是正确的，但是没有使用正确的命令或未正确使用的命令。但是if可能会陷入困境。

未使用所需数据填充bad.txt和good.txt文件。两个文件中的数字都以两个结尾，或者某些数字一起丢失。
good.txt最终还是空的，即使我手动搜索了两个匹配的数字。

Answer 1

如果我正确理解了您的问题，那么也许应该可以解决问题

#!/bin/bash

# All files are assumed to be in the same directory. Please modify the paths if necessary.

# Opening files for writing

exec 3>./Bad.txt
exec 4>./Good.txt
exec 5>./correction.sed

#Creating an array for the account numbers.
while read line; do
    accountNumber[$line]=$line
done < ./NumsOnFile.txt

# Comparing the user's file with your account file
while read line; do
    # That takes only the first 4 characters. If your account number are of a different length please modify
    accUser=${line:0:4}
    if [[ ${accountNumber[$accUser]} -ne $line ]]; then
        #if different then write the bad file and the script to correct the original file
        echo $line >&3
        echo "s|$line|$accUser|g" >&5
    else
        #if same, just write the good file
        echo $line >&4
    fi
done < ./UserNums.txt

# Closing files

exec 3>&-
exec 4>&-
exec 5>&-

# Executing sed script to correct the input file

sed -i.bck --file=./correction.sed ./UserNums.txt

希望它会有所帮助。

编辑

经过编辑以考虑到Charles的评论

Answer 2

您可以使用以下命令进行输出

cat NumsOnFile.txt UserNums.txt | cut -c1-4 |sort | uniq -d > good.txt
grep -vFxf NumsOnFile.txt UserNums.txt > bad.txt

查找两个文件之间的公用/不相交的行，包括文件B中以文件A

现有实施尝试

阶段1：消除已经正确的行

阶段2：尝试查找子字符串

这是怎么了？

2 个答案:

编辑