Question

这让我发疯了。有下一个bash脚本。

testdir="./test.$$"
echo "Creating a testing directory: $testdir"
mkdir "$testdir"
cd "$testdir" || exit 1

echo "Creating a file word.txt with content á.txt"
echo 'á.txt' > word.txt

fname=$(cat word.txt)
echo "The word.txt contains:$fname"

echo "creating a file $fname with a touch"
touch $fname
ls -l

echo "command: bash cycle"
while read -r line
do
    [[ -e "$line" ]] && echo "$line is a file"
done < word.txt

echo "command: find . -name $fname -print"
find . -name $fname -print

echo "command: find . -type f -print | grep $fname"
find . -type f -print | grep "$fname"

echo "command: find . -type f -print | fgrep -f word.txt"
find . -type f -print | fgrep -f word.txt

在Freebsd上（也可能在Linux上）给出了结果：

Creating a testing directory: ./test.64511
Creating a file word.txt with content á.txt
The word.txt contains:á.txt
creating a file á.txt with a touch
total 1
-rw-r--r--  1 clt  clt  7  3 júl 12:51 word.txt
-rw-r--r--  1 clt  clt  0  3 júl 12:51 á.txt
command: bash cycle
á.txt is a file
command: find . -name á.txt -print
./á.txt
command: find . -type f -print | grep á.txt
./á.txt
command: find . -type f -print | fgrep -f word.txt
./á.txt

即使在安装了cygwin的Windows 7中，运行该脚本也会提供正确的结果。

但是当我在 OS X bash上运行此脚本时，得到了这个：

Creating a testing directory: ./test.32534
Creating a file word.txt with content á.txt
The word.txt contains:á.txt
creating a file á.txt with a touch
total 8
-rw-r--r--  1 clt  staff  0  3 júl 13:01 á.txt
-rw-r--r--  1 clt  staff  7  3 júl 13:01 word.txt
command: bash cycle
á.txt is a file
command: find . -name á.txt -print
command: find . -type f -print | grep á.txt
command: find . -type f -print | fgrep -f word.txt

因此，只有bash找到了文件á.txt否，find或grep。：（

首先询问apple.stackexchange和one answer suggesting使用iconv转换文件名。

$ find . -name $(iconv -f utf-8 -t utf-8-mac <<< á.txt)

虽然这适用于“OS X”，但无论如何它都很糟糕。（需要为每个进入终端的utf8字符串输入另一个命令。）

我正在尝试寻找一般的跨平台bash编程解决方案。所以，问题是：

为什么在OS X上bash“找到”文件而find没有？

和

如何编写跨平台bash脚本，其中unicode文件名存储在文件中。
仅解决方案仅针对带有iconv的OS X 特殊版本？
存在其他脚本语言的可移植解决方案，例如perl等等？

Ps：最后，不是真正的编程问题，但想知道Apple决定使用分解文件名的原因是什么不能很好地与命令行utf8

一起使用

修改

简单od。

$ ls | od -bc
0000000   141 314 201 056 164 170 164 012 167 157 162 144 056 164 170 164
           a   ́    **   .   t   x   t  \n   w   o   r   d   .   t   x   t
0000020   012                                                            
          \n

和

$ od -bc word.txt
0000000   303 241 056 164 170 164 012                                    
           á  **   .   t   x   t  \n                                    
0000007

所以

$ while read -r line; do echo "$line" | od -bc; done < word.txt
0000000   303 241 056 164 170 164 012                                    
           á  **   .   t   x   t  \n                                    
0000007

来自查找的outpout与ls

相同

$ find . -print | od -bc
0000000   056 012 056 057 167 157 162 144 056 164 170 164 012 056 057 141
           .  \n   .   /   w   o   r   d   .   t   x   t  \n   .   /   a
0000020   314 201 056 164 170 164 012                                    
           ́    **   .   t   x   t  \n

因此，word.txt的内容与其内容创建的文件不同。因此，仍然没有解释为什么bash找到了文件。

Answer 1

Unicode很难。每次刷牙时都要重复。

您的á.txt文件名包含5个字符，其中á是个麻烦的字符。将á表示为Unicode代码点序列的方法不止一种。有预先组合的表示和分解的表示。不幸的是，大多数软件都没有准备好处理字符，而是为代码点设置（大多数软件都是cr * p）。这意味着给定相同字符的预合成和分解表示，软件不会将它们识别为相同。

你有一个预先组合的á，表示为Unicode代码点U + 00E1 LATIN SMALL LETTER A WITH ACUTE。 Windows使用预先组合的表示。 Mac文件系统坚持分解表示（嗯，大多数情况下; utf-8-mac不分解某些字符范围，但á被分解OK）。所以在Mac上你的á变成U + 0061 LATIN SMALL LETTER A然后是U + 0301 COMBINING ACUTE ACCENT（写下我的头顶，没有Mac便利）。 Linux文件系统接受你向他们扔的任何东西。

如果你给find一个预先组合的á，它就不会在其名称中找到一个分解为á的文件，因为它不准备处理这个骚扰

那么解决方案是什么？没有。如果要处理Unicode，则必须解决常见工具的缺陷。

这是一个稍微不那么丑陋的解决方法。写一个小bash函数（使用iconv或其他），每个系统将转换该系统可接受的表示，并在整个过程中使用它。我们称之为u8：

find . -name $(u8 $myfilename) -print 
find . -name -type f -print | fgrep $(u8 $myfilename)

等等。相当不是，但它应该有效。

哦，我想我们都应该开始为这个cr * p发送错误报告。我们的软件最终应该努力理解像人物一样的基本人类概念（我甚至没有开始谈论字符串）。对不起，即使代码点是Unicode代码点，代码点也不会削减它。

Answer 2

使用touch创建文件并使用[[ -e "$line" ]]测试其存在时使用相同的编码，以便找到该文件。

使用find -name和find -print测试其存在似乎使用不同的编码。我建议将find -print的输出管道输入到hexdumper（xxd或od -x或类似的）中。这可能会显示find在使用-print时使用的编码（在使用-name时也可能会使用此选项）。

编码问题的一般解决方案始终是：仅使用一个编码。在您的情况下，您应该决定哪个点更容易采用;您可以在创建文件（touch "$(iconv -f utf-8 -t utf-8-mac <<< á.txt)"）或类似内容时更改编码，或更改您对find的内容（已在您的问题中给出的解决方案）。由于bash本身似乎能够很好地处理unicode文件名，并且只有find似乎有这个问题，我还建议在那里进行必要的转换。也许有一个Mac OS查找版本的配置选项，它说明它将用于-name和-print命令的编码。

使用unicode文件名的可移植（跨平台）脚本

2 个答案: