如何在文本文件中检测无效的utf8 unicode / binary

时间:2015-04-06 04:58:28

标签: linux bash utf-8 character-encoding

我需要检测损坏的文本文件,其中存在无效(非ASCII)utf-8,Unicode或二进制字符。

�>t�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½o��������ï¿ï¿½_��������������������o����������������������￿����ß����������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~�ï¿ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}���������}w��׿��������������������������������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~������������������������������������_������������������������������������������������������������������������������^����ï¿ï¿½s�����������������������������?�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½}����������ï¿ï¿½ï¿½ï¿½ï¿½y����������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½o�������������������������}��

我的尝试:

iconv -f utf-8 -t utf-8 -c file.csv 

将文件从utf-8编码转换为utf-8编码,-c用于跳过无效的utf-8字符。然而最后这些非法字符仍然被打印出来。在linux或其他语言的bash中还有其他解决方案吗?

9 个答案:

答案 0 :(得分:41)

假设您的语言环境设置为UTF-8,这可以很好地识别无效的UTF-8序列:

grep -axv '.*' file.txt

答案 1 :(得分:10)

对于非ASCII字符,我会grep

使用带有pcre的GNU grep(由于-P,总是不可用。在FreeBSD上你可以在包pcre2中使用pcregrep)你可以这样做:

grep -P "[\x80-\xFF]" file

How Do I grep For all non-ASCII Characters in UNIX中的参考。所以,事实上,如果你只想检查文件是否包含非ASCII字符,你可以说:

if grep -qP "[\x80-\xFF]" file ; then echo "file contains ascii"; fi
#        ^
#        silent grep

要删除这些字符,您可以使用:

sed -i.bak 's/[\d128-\d255]//g' file

这将创建一个file.bak文件作为备份,而原始file将删除其非ASCII字符。参考Remove non-ascii characters from csv

答案 2 :(得分:4)

您所看到的是根据定义已损坏。显然,您正在显示文件,因为它在Latin-1中呈现;三个字符�代表三个字节值0xEF 0xBF 0xBD。但这些是Unicode REPLACEMENT CHARACTER U+FFFD的UTF-8编码,它是尝试将字节从未知或未定义的编码转换为UTF-8,并且可以正确显示为 (如果您有浏览器)从本世纪开始,你应该看到一个带有问号的黑色钻石;但这也取决于你使用的字体等。)

所以关于“如何检测”这种特殊现象的问题很容易; Unicode代码点U + FFFD是一个死的赠品,也是您所暗示的唯一可能的症状。

这是“无效的Unicode”或“无效的UTF-8”,因为这是一个有效的UTF-8序列,它编码一个有效的Unicode代码点;只是这个特定代码点的语义是“这是一个无法正确表示的字符的替换字符”,即无效输入。

至于如何首先防止它,答案非常简单,但也很缺乏信息 - 您需要确定何时以及如何进行错误编码,并修复产生此无效输出的过程。< / p>

要删除U + FFFD字符,请尝试

perl -CSD -pe 's/\x{FFFD}//g' file

但同样,正确的解决办法是首先不要产生这些错误的输出。

(你没有透露你的示例数据的编码。它可能有额外的损坏。如果你向我们展示的是UTF-8渲染的复制/粘贴数据,它已被“双重编码”。换句话说,有人采取 - 已经损坏,如上所述 - UTF-8文本并告诉计算机将其从Latin-1转换为UTF-8。撤消这很简单;只需将其“返回”转换为Latin-1。在获得多余的错误转换之前,您获得的数据应该是原始的UTF-8数据。)

答案 3 :(得分:3)

此Perl程序应删除所有非ASCII字符:

 foreach $file (@ARGV) {
   open(IN, $file);
   open(OUT, "> super-temporary-utf8-replacement-file-which-should-never-be-used-EVER");
   while (<IN>) {
     s/[^[:ascii:]]//g;
     print OUT "$_";
   }
   rename "super-temporary-utf8-replacement-file-which-should-never-be-used-EVER", $file;
}

这样做是在命令行上将文件作为输入,如下所示:
perl fixutf8.pl foo bar baz
然后,对于每一行,它将没有任何内容(删除)替换为非ASCII字符的每个实例 然后它将此修改后的行写入super-temporary-utf8-replacement-file-which-should-never-be-used-EVER(命名为不会修改任何其他文件。)
然后,它将临时文件重命名为原始文件。

这会接受所有ASCII字符(包括DEL,NUL,CR等),以防您对它们有一些特殊用途。如果您只想要可打印的字符,只需将:ascii:替换为:print:中的s///

我希望这有帮助!如果这不是你想要的,请告诉我。

答案 4 :(得分:2)

尝试此操作,以便从外壳中查找非ASCII字符。

命令:

$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/'  utf8.txt

输出:

2 Pour être ou ne pas être
4 Byť či nebyť
5 是或不

答案 5 :(得分:1)

我可能正在重复其他人已经说过的话。但我认为您的无效字符仍然会被打印,因为它们可能有效。 通用字符集试图引用全球常用的字符,以便能够编写不依赖于特殊字符集的健壮软件。

所以我认为您的问题可能是以下两种情况之一 - 假设您的总体目标是通常从utf文件处理此(恶意)输入:

  1. 无效 utf8字符(更好地称为无效字节序列 - 为此我想引用相应的Wikipedia-Article)。
  2. 您当前的显示字体中存在缺席等效项,这些等效项由特殊符号替换或显示为二进制ASCII等效字符(fe - i因此请参考以下帖子:UTF-8 special characters don't show up)。
  3. 所以在我看来,你有两种可能的方法来解决这个问题:

    1. 将所有字符从utf8转换为可处理的内容 - f.e。 ASCII - 这可以在f.e.完成。与iconv -f utf-8 -t ascii -o file_in_ascii.txt file_in_utf8.txt。但是 小心 从较宽的字符空间(utf)转移到较小的字符空间(utf)可能会导致数据丢失。
    2. 正确处理utf(8) - 这就是世界写东西的方式。如果您认为由于任何限制后处理步骤而可能不得不依赖ASCII-chars,请停止并重新考虑。在大多数情况下,后处理器已经支持utf,最好找出如何使用它。你正在使你的东西成为未来和防弹。
    3. 处理utf可能看起来很棘手,以下步骤可能会帮助您实现准备就绪:

      • 能够正确显示utf或确保您的显示堆栈(操作系统,终端等)能够显示足够的unicode子集(当然,这应该满足您的需求),这可以防止需要在许多情况下,十六进制编辑器。不幸的是,utf太大了,只有一种字体,但是这个帖子的好点是这样的帖子:https://stackoverflow.com/questions/586503/complete-monospaced-unicode-font
      • 能够过滤无效的字节序列。并且有很多方法可以实现这一点,这篇文章显示了很多种方式:Filtering invalid utf8 - 我想特别指出第4个答案,它建议使用uconv来允许你设置无效序列的回调处理程序。
      • 阅读更多关于unicode的内容。

答案 6 :(得分:1)

python 3中的一个非常脏的解决方案

import sys
with open ("cur.txt","r",encoding="utf-8") as f:
    for i in f:
            for c in i:
                 if(ord(c)<128):
                     print(c,end="")

输出应为:

>two_o~}}w~_^s?w}yo}

答案 7 :(得分:1)

以下C程序检测到无效的utf8字符。 它在linux系统上经过测试和使用。

/*
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.
*/

#include <stdio.h>
#include <stdlib.h>

void usage( void ) {
    printf( "Usage: test_utf8 file ...\n" );

    return;
}

int line_number = 1;
int char_number = 1;
char *file_name = NULL;

void inv_char( void ) {
    printf( "%s: line : %d - char %d\n", file_name, line_number, char_number );

    return;
}

int main( int argc, char *argv[]) {

    FILE *out = NULL;
    FILE *fh = NULL;

//    printf( "argc: %d\n", argc );

    if( argc < 2 ) {
        usage();
        exit( 1 );
    }

//    printf( "File: %s\n", argv[1] );

    file_name = argv[1];

    fh = fopen( file_name, "rb" );
    if( ! fh ) {
        printf( "Could not open file '%s'\n", file_name );
        exit( 1 );
    }

    int utf8_type = 1;
    int utf8_1 = 0;
    int utf8_2 = 0;
    int utf8_3 = 0;
    int utf8_4 = 0;
    int byte_count = 0;
    int expected_byte_count = 0;

    int cin = fgetc( fh );
    while( ! feof( fh ) ) {
        switch( utf8_type ) {
            case 1:
                if( (cin & 0x80) ) {
                    if( (cin & 0xe0) == 0xc0 ) {
                        utf8_1 = cin;
                        utf8_type = 2;
                        byte_count = 1;
                        expected_byte_count = 2;
                        break;
                    }

                    if( (cin & 0xf0) == 0xe0 ) {
                        utf8_1 = cin;
                        utf8_type = 2;
                        byte_count = 1;
                        expected_byte_count = 3;
                        break;
                    }

                    if( (cin & 0xf8) == 0xf0 ) {
                        utf8_1 = cin;
                        utf8_type = 2;
                        byte_count = 1;
                        expected_byte_count = 4;
                        break;
                    }

                    inv_char();
                    utf8_type = 1;
                    break;
                }

                break;

            case 2:
            case 3:
            case 4:
//                printf( "utf8_type - %d\n", utf8_type );
//                printf( "%c - %02x\n", cin, cin );
                if( (cin & 0xc0) == 0x80 ) {
                    if( utf8_type == expected_byte_count ) {
                        utf8_type = 1;
                        break;
                    }

                    byte_count = utf8_type;
                    utf8_type++;

                    if( utf8_type == 5 ) {
                        utf8_type = 1;
                    }

                    break;
                }

                inv_char();
                utf8_type = 1;
                break;

            default:
                inv_char();
                utf8_type = 1;
                break;
        }

        if( cin == '\n' ) {
            line_number ++;
            char_number = 0;
        }

        if( out != NULL ) {
            fputc( cin, out );
        }

//        printf( "lno: %d\n", line_number );

        cin = fgetc( fh );
        char_number++;
    }

    fclose( fh );

    return 0;
}

答案 8 :(得分:0)

...我正在尝试检测文件是否已损坏字符。我也是 有兴趣删除它们。

使用ugrep很容易,并且只需一行:

ugrep -q -e "." -N "\p{Unicode}" file.csv && echo "file is corrupted"

要删除无效的Unicode字符,请执行以下操作:

ugrep "\p{Unicode}" --format="%o" file.csv

第一个命令将匹配-e "."的任何字符,但带有-N "\p{Unicode}"的有效Unicode除外,这是要跳过的“负模式”。

第二个命令匹配一个Unicode字符"\p{Unicode}"并将其写入--format="%o"