Question

我有一个sortedurls.txt文件，该文件是将域名逐行抓取到URL的结果。 sortedurls.txt看起来像这样

https://example.com/page1.php
https://example.com/page2.php
https://example.com/page-more.php

逐行循环sortedurls.txt（逐个URL），并使用wget和hxselect从页面收集img标签。仅用于验证保存到文件testtagstring.txt。看起来像这样

<img alt="…" src="/assets/…/image1.jpg">§<img alt="…" src="/assets/…/image11.jpg">
<img alt="…" src="/assets/…/image2.jpg">§

以此类推

将定界符§的每一行拆分为数组“标签”。计算数组元素并将结果附加到文件中进行验证。

问题：在终端中执行正常，并且输出显示正确数量的条目（6、1、1、9…）。从cronjob执行后，IFS会将金额加倍到12，2，2，18…。

您知道为什么仅通过使用cron会改变其行为吗？

#!/bin/bash

# Set this script dir path
scriptdirpath=/usr/local/www/apache24/data/mydomain.com/testdir

# Some config variables
useragent=googlebot
searchtag=img
delimiter=§

# Change to pwd
cd $scriptdirpath


# Make files
echo > testtagstring.txt
echo > testimages.txt

# Loop through the sortedurls.txt
while read p; do

tagString=$(wget -qO - --user-agent="$useragent" $p | hxnormalize -x | hxselect -s "$delimiter" $searchtag )

echo $tagString >> testtagstring.txt

IFS="$delimiter" read -r -a tags <<<"$tagString"

echo "Amount of img tags: ${#tags[@]}" >> $scriptdirpath/testimages.txt

done < $scriptdirpath/sortedurls.txt

Answer 1

我的脚本采用UTF-8格式，因此对于配置为使用ASCII的cron而言，它们实际上不是有效的。在我的bash脚本中添加以下内容可以解决该问题，而无需对cron配置进行任何更改。

LC_ALL_SAVED="$LC_ALL"
export LC_ALL=de_DE.UTF-8

现在从CLI和cron一切运行正常。感谢您的帮助。

bash IFS因终端和cron执行而不同

1 个答案: