Question

BASH相当新，并寻求一些建议，因为即便开始这项工作。

我有一个列出大量图片的网页，如此

<img src="01.jpg" alt="" width="1920" height="1080" />
<img src="02.jpg" alt="" width="1920" height="1080" />
<img src="03.jpg" alt="" width="1920" height="1080" />

我想运行BASH来阅读这个网页，它的本地，拿起文件名，即01.jpg，02.jpg和03.jpg，然后删除目录中的所有其他.jpg文件匹配。因此，例如，如果文件夹也有04.jpg，则该文件将被删除，因为它不在网页中。

抱歉，我没有发布任何编码，只是根本没有理解这一点。

提前谢谢

Answer 1

使用Python和BeautifulSoup（Python的强大HTML解析器模块）的解决方案：

python -c '
import sys, glob, bs4;
print("\n".join(
    set(glob.glob("*.jpg")) -
    set(e["src"] for e in bs4.BeautifulSoup(sys.stdin.read()).find_all("img"))
))' < file.htm | xargs rm`

一些评论：它打印当前目录中的jpg文件与<img src="..">标签中找到的文件名之间的设置差异，每行一项

Answer 2

这应该适合你：

find . -maxdepth 1 -name "*.jpg" -type f -exec bash -c \
    'f="{}"; f=${f#./}; if ! grep -wq "img src=\"$f\"" file.html; then rm "$f"; echo "Removed $f"; fi' \;

Answer 3

有很多方法可以解决这个问题。一种是用目录中的所有jpg文件填充数组，然后有选择地删除html文件中找不到的jpg文件。

注意：实际删除的文件已注释掉，您可以在启用实际删除之前确认操作。目前，脚本只打印保存的内容和删除的内容：

#!/bin/bash

[ -z $1 ] && {
    printf "error: insufficient input. usage:  %s path/to/file.html\n" ${0##*/}
    exit 1
}

[ -r "$1" ] || {
    printf "error: invalid filename '%s'. usage:  %s path/to/file.html\n" "$1" ${0##*/}
    exit 1
}

fname=${1##*/}  ## split filename/path
fpath=${1%/*}

[ "$fname" = "$fpath" ] && fpath="./"

jpgarray=( ${fpath}/*.jpg )                 ## read jpg files in directory

for i in ${jpgarray[@]}; do
    tmp=${i##*/}
    if grep "$tmp" "$1" >/dev/null; then
        printf "    file: %s exists in %s -- don't delete\n" "$i" "$1"
    else
        printf "    file: %s does NOT exist in %s -- deleting\n" "$i" "$1"
        # rm "${fpath}/${fname}"
    fi
done

exit 0

目录中的jpg文件

$ ls -1 dat/*.jpg
dat/01.jpg
dat/02.jpg
dat/03.jpg
dat/04.jpg
dat/05.jpg
dat/06.jpg

输入文件

$ cat dat/jpgnames.html
<img src="01.jpg" alt="" width="1920" height="1080" />
<img src="02.jpg" alt="" width="1920" height="1080" />
<img src="03.jpg" alt="" width="1920" height="1080" />

使用/输出

$ bash findjpg.sh dat/jpgnames.html file: dat/01.jpg exists in dat/jpgnames.html -- don't delete file: dat/02.jpg exists in dat/jpgnames.html -- don't delete file: dat/03.jpg exists in dat/jpgnames.html -- don't delete file: dat/04.jpg does NOT exist in dat/jpgnames.html -- deleting file: dat/05.jpg does NOT exist in dat/jpgnames.html -- deleting file: dat/06.jpg does NOT exist in dat/jpgnames.html -- deleting

Answer 4

此脚本仅在您只有1个网页需要检查时才有效，但在语法方面有更高效的脚本，但我认为这对初学者来说更容易理解：

#!/bin/bash
## loop through all the files in the image folder
for FILENAME in /path/to/image/folder/*; do

    # for each file, check (case insensitive) if it exists in your web page
    if grep -qi $(basename "$FILENAME") /path/to/webpage.html
    then
        # image file found in webpage
        echo "$FILENAME found, not deleting"
    else
        # image file not found in webpage
        echo "$FILENAME found, moving to trash"
        mv "$FILENAME" /path/to/trash/folder
    fi
done

它还会将文件移动到垃圾文件夹，以防您需要恢复它们！

BASH删除未在html文件中列出的文件类型

4 个答案: