我需要检测损坏的文本文件,其中存在无效(非ASCII)utf-8,Unicode或二进制字符。
�>t�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½o��������ï¿ï¿½_��������������������o����������������������￿����ß����������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~�ï¿ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}���������}w��׿��������������������������������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~������������������������������������_������������������������������������������������������������������������������^����ï¿ï¿½s�����������������������������?�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½}����������ï¿ï¿½ï¿½ï¿½ï¿½y����������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½o�������������������������}��
我的尝试:
iconv -f utf-8 -t utf-8 -c file.csv
将文件从utf-8编码转换为utf-8编码,-c
用于跳过无效的utf-8字符。然而最后这些非法字符仍然被打印出来。在linux或其他语言的bash中还有其他解决方案吗?
答案 0 :(得分:41)
假设您的语言环境设置为UTF-8,这可以很好地识别无效的UTF-8序列:
grep -axv '.*' file.txt
答案 1 :(得分:10)
对于非ASCII字符,我会grep
。
使用带有pcre的GNU grep(由于-P
,总是不可用。在FreeBSD上你可以在包pcre2中使用pcregrep)你可以这样做:
grep -P "[\x80-\xFF]" file
How Do I grep For all non-ASCII Characters in UNIX中的参考。所以,事实上,如果你只想检查文件是否包含非ASCII字符,你可以说:
if grep -qP "[\x80-\xFF]" file ; then echo "file contains ascii"; fi
# ^
# silent grep
要删除这些字符,您可以使用:
sed -i.bak 's/[\d128-\d255]//g' file
这将创建一个file.bak
文件作为备份,而原始file
将删除其非ASCII字符。参考Remove non-ascii characters from csv。
答案 2 :(得分:4)
您所看到的是根据定义已损坏。显然,您正在显示文件,因为它在Latin-1中呈现;三个字符�代表三个字节值0xEF 0xBF 0xBD。但这些是Unicode REPLACEMENT CHARACTER U+FFFD的UTF-8编码,它是尝试将字节从未知或未定义的编码转换为UTF-8,并且可以正确显示为 (如果您有浏览器)从本世纪开始,你应该看到一个带有问号的黑色钻石;但这也取决于你使用的字体等。)
所以关于“如何检测”这种特殊现象的问题很容易; Unicode代码点U + FFFD是一个死的赠品,也是您所暗示的唯一可能的症状。
这是“无效的Unicode”或“无效的UTF-8”,因为这是一个有效的UTF-8序列,它编码一个有效的Unicode代码点;只是这个特定代码点的语义是“这是一个无法正确表示的字符的替换字符”,即无效输入。
至于如何首先防止它,答案非常简单,但也很缺乏信息 - 您需要确定何时以及如何进行错误编码,并修复产生此无效输出的过程。< / p>
要删除U + FFFD字符,请尝试
perl -CSD -pe 's/\x{FFFD}//g' file
但同样,正确的解决办法是首先不要产生这些错误的输出。
(你没有透露你的示例数据的编码。它可能有额外的损坏。如果你向我们展示的是UTF-8渲染的复制/粘贴数据,它已被“双重编码”。换句话说,有人采取 - 已经损坏,如上所述 - UTF-8文本并告诉计算机将其从Latin-1转换为UTF-8。撤消这很简单;只需将其“返回”转换为Latin-1。在获得多余的错误转换之前,您获得的数据应该是原始的UTF-8数据。)
答案 3 :(得分:3)
此Perl程序应删除所有非ASCII字符:
foreach $file (@ARGV) {
open(IN, $file);
open(OUT, "> super-temporary-utf8-replacement-file-which-should-never-be-used-EVER");
while (<IN>) {
s/[^[:ascii:]]//g;
print OUT "$_";
}
rename "super-temporary-utf8-replacement-file-which-should-never-be-used-EVER", $file;
}
这样做是在命令行上将文件作为输入,如下所示:
perl fixutf8.pl foo bar baz
然后,对于每一行,它将没有任何内容(删除)替换为非ASCII字符的每个实例
然后它将此修改后的行写入super-temporary-utf8-replacement-file-which-should-never-be-used-EVER
(命名为不会修改任何其他文件。)
然后,它将临时文件重命名为原始文件。
这会接受所有ASCII字符(包括DEL,NUL,CR等),以防您对它们有一些特殊用途。如果您只想要可打印的字符,只需将:ascii:
替换为:print:
中的s///
。
我希望这有帮助!如果这不是你想要的,请告诉我。
答案 4 :(得分:2)
尝试此操作,以便从外壳中查找非ASCII字符。
命令:
$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/' utf8.txt
输出:
2 Pour être ou ne pas être
4 Byť či nebyť
5 是或不
答案 5 :(得分:1)
我可能正在重复其他人已经说过的话。但我认为您的无效字符仍然会被打印,因为它们可能有效。 通用字符集试图引用全球常用的字符,以便能够编写不依赖于特殊字符集的健壮软件。
所以我认为您的问题可能是以下两种情况之一 - 假设您的总体目标是通常从utf文件处理此(恶意)输入:
所以在我看来,你有两种可能的方法来解决这个问题:
iconv -f utf-8 -t ascii -o file_in_ascii.txt file_in_utf8.txt
。但是 小心 从较宽的字符空间(utf)转移到较小的字符空间(utf)可能会导致数据丢失。处理utf可能看起来很棘手,以下步骤可能会帮助您实现准备就绪:
uconv
来允许你设置无效序列的回调处理程序。答案 6 :(得分:1)
python 3中的一个非常脏的解决方案
import sys
with open ("cur.txt","r",encoding="utf-8") as f:
for i in f:
for c in i:
if(ord(c)<128):
print(c,end="")
输出应为:
>two_o~}}w~_^s?w}yo}
答案 7 :(得分:1)
以下C程序检测到无效的utf8字符。 它在linux系统上经过测试和使用。
/*
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
#include <stdio.h>
#include <stdlib.h>
void usage( void ) {
printf( "Usage: test_utf8 file ...\n" );
return;
}
int line_number = 1;
int char_number = 1;
char *file_name = NULL;
void inv_char( void ) {
printf( "%s: line : %d - char %d\n", file_name, line_number, char_number );
return;
}
int main( int argc, char *argv[]) {
FILE *out = NULL;
FILE *fh = NULL;
// printf( "argc: %d\n", argc );
if( argc < 2 ) {
usage();
exit( 1 );
}
// printf( "File: %s\n", argv[1] );
file_name = argv[1];
fh = fopen( file_name, "rb" );
if( ! fh ) {
printf( "Could not open file '%s'\n", file_name );
exit( 1 );
}
int utf8_type = 1;
int utf8_1 = 0;
int utf8_2 = 0;
int utf8_3 = 0;
int utf8_4 = 0;
int byte_count = 0;
int expected_byte_count = 0;
int cin = fgetc( fh );
while( ! feof( fh ) ) {
switch( utf8_type ) {
case 1:
if( (cin & 0x80) ) {
if( (cin & 0xe0) == 0xc0 ) {
utf8_1 = cin;
utf8_type = 2;
byte_count = 1;
expected_byte_count = 2;
break;
}
if( (cin & 0xf0) == 0xe0 ) {
utf8_1 = cin;
utf8_type = 2;
byte_count = 1;
expected_byte_count = 3;
break;
}
if( (cin & 0xf8) == 0xf0 ) {
utf8_1 = cin;
utf8_type = 2;
byte_count = 1;
expected_byte_count = 4;
break;
}
inv_char();
utf8_type = 1;
break;
}
break;
case 2:
case 3:
case 4:
// printf( "utf8_type - %d\n", utf8_type );
// printf( "%c - %02x\n", cin, cin );
if( (cin & 0xc0) == 0x80 ) {
if( utf8_type == expected_byte_count ) {
utf8_type = 1;
break;
}
byte_count = utf8_type;
utf8_type++;
if( utf8_type == 5 ) {
utf8_type = 1;
}
break;
}
inv_char();
utf8_type = 1;
break;
default:
inv_char();
utf8_type = 1;
break;
}
if( cin == '\n' ) {
line_number ++;
char_number = 0;
}
if( out != NULL ) {
fputc( cin, out );
}
// printf( "lno: %d\n", line_number );
cin = fgetc( fh );
char_number++;
}
fclose( fh );
return 0;
}
答案 8 :(得分:0)
...我正在尝试检测文件是否已损坏字符。我也是 有兴趣删除它们。
使用ugrep很容易,并且只需一行:
ugrep -q -e "." -N "\p{Unicode}" file.csv && echo "file is corrupted"
要删除无效的Unicode字符,请执行以下操作:
ugrep "\p{Unicode}" --format="%o" file.csv
第一个命令将匹配-e "."
的任何字符,但带有-N "\p{Unicode}"
的有效Unicode除外,这是要跳过的“负模式”。
第二个命令匹配一个Unicode字符"\p{Unicode}"
并将其写入--format="%o"
。