Question

我正在研究shell脚本，并且有一个练习要求计算文件夹的所有文件的md5哈希值。它还要求，如果有两个具有相同散列的文件，则在终端中打印它们的名称。我的代码可以做到这一点，但一旦找到匹配，它就会打印两次。我无法弄清楚如何从下一次迭代中排除第一个文件名。另一件事：它被禁止创建任何临时文件来帮助完成任务。

#!/bin/bash

ifs=$IFS
IFS=$'\n'

echo "Verifying the files inside the directory..."

for file1 in $(find . -maxdepth 1 -type f | cut -d "/" -f2); do
  md51=$(md5sum $file1  | cut -d " " -f1)
  for file2 in $(find . -maxdepth 1 -type f | cut -d "/" -f2 | grep -v "$file1"); do
    md52=$(md5sum $file2 | cut -d " " -f1)
    if [ "$md51" == "$md52" ]; then
      echo "Files $file1 e $file2 are the same."
    fi
  done
done

我也想知道是否有更有效的方法来完成这项任务。

Answer 1

此

mapfile -t list < <(find . -maxdepth 1 -type f -exec md5sum {} + | sort)
mapfile -t dups < <(printf "%s\n" "${list[@]}" | grep -f <(printf "^%s\n" "${list[@]}" | sed 's/ .*//' | sort | uniq -d))

# here the array dups containing the all duplicates along with their md5sum
# you can print the array using a simple
printf "%s\n" "${dups[@]}"

将获得如下输出：

3b0332e02daabf31651a5a0d81ba830a  ./f2.txt
3b0332e02daabf31651a5a0d81ba830a  ./fff
c9eb23b681c34412f6e6f3168e3990a4  ./both.txt
c9eb23b681c34412f6e6f3168e3990a4  ./f_out
d41d8cd98f00b204e9800998ecf8427e  ./aa
d41d8cd98f00b204e9800998ecf8427e  ./abc def.xxx
d41d8cd98f00b204e9800998ecf8427e  ./dudu
d41d8cd98f00b204e9800998ecf8427e  ./start
d41d8cd98f00b204e9800998ecf8427e  ./xx_yy

以下添加仅适用于更高级的打印输出

echo "duplicates:"
while read md5; do
        echo "$md5"
        printf "%s\n" "${dups[@]}" | grep "$md5" | sed 's/[^ ]* /  /'
done < <(printf "%s\n" "${dups[@]}" | sed 's/ .*//' | sort -u)

将打印如下内容：

3b0332e02daabf31651a5a0d81ba830a
   ./f2.txt
   ./fff
c9eb23b681c34412f6e6f3168e3990a4
   ./both.txt
   ./f_out
d41d8cd98f00b204e9800998ecf8427e
   ./aa
   ./abc def.xxx
   ./dudu
   ./start
   ./xx_yy

警告：仅当文件名不包含\n（换行符）字符时才有效。修改脚本一般需要bash 4.4+，其中mapfile知道-d参数。

Answer 2

这是一种更有效的方法，它不使用任何临时文件：

#!/bin/bash

# get the sorted md5sum list of all files into an array in one shot
readarray -t arr < <(find . -maxdepth 1 -type f -exec md5sum {} + | sort)
# loop through the array and compare md5sum of contiguous items
for i in "${arr[@]}"; do
  md5="${i/ */}" # extract md5sum part
  [[ "$md5" = "$prev_md5" ]] && printf '%s\n' "$prev_i" "$i"
  prev_md5="$md5"
  prev_i="$i"
done | sort -u

sort -u需要删除当有两个以上相同文件时打印的重复项

对目录中的文件执行md5sum并检查是否存在相同的文件

2 个答案: