我有一个使用OCR软件创建的文本文件 - 大小约为1兆字节。 一些不常见的字符出现在整个文档中,其中大多数都是OCR错误。
我想找到文档中使用的所有字符,以便轻松发现错误(例如UNIQ
命令,但是对于字符,而不是行。)
我在Ubuntu上。 我应该使用什么Unix命令来显示文本文件中使用的所有字符?
答案 0 :(得分:9)
这应该做你想要的:
cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c
前提是sed
将文件中的每个字符单独放在一行上,然后通常的sort | uniq -c
序列除了发生的每个唯一字符之外的所有字符,并提供计数每次发生多少次。
此外,您可以将| sort -n
附加到整个序列的末尾,以按输出每个字符的次数对输出进行排序。例如:
$ echo hello | sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n
1
1 e
1 h
1 o
2 l
答案 1 :(得分:1)
这样做:
#!/usr/bin/perl -n
#
# charcounts - show how many times each code point is used
# Tom Christiansen <tchrist@perl.com>
use open ":utf8";
++$seen{ ord() } for split //;
END {
for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) {
printf "%04X %d\n", $cp, $seen{$cp};
}
}
自行运行,该程序产生:
$ charcounts /tmp/charcounts | head
0020 46
0065 20
0073 18
006E 15
000A 14
006F 12
0072 11
0074 10
0063 9
0070 9
如果你想要角色的文字和/或名字,也很容易添加。
如果你想要更复杂的东西,这个程序通过Unicode属性计算出字符。它可能足以满足您的目的,如果没有,您应该能够适应它。
#!/usr/bin/perl
#
# unicats - show character distribution by Unicode character property
# Tom Christiansen <tchrist@perl.com>
use strict;
use warnings qw<FATAL all>;
use open ":utf8";
my %cats;
our %Prop_Table;
build_prop_table();
if (@ARGV == 0 && -t STDIN) {
warn <<"END_WARNING";
$0: reading UTF-8 character data directly from your tty
\tSo please type stuff...
\t and then hit your tty's EOF sequence when done.
END_WARNING
}
while (<>) {
for (split(//)) {
$cats{Total}++;
if (/\p{ASCII}/) { $cats{ASCII}++ }
else { $cats{Unicode}++ }
my $gcat = get_general_category($_);
$cats{$gcat}++;
my $subcat = get_general_subcategory($_);
$cats{$subcat}++;
}
}
my $width = length $cats{Total};
my $mask = "%*d %s\n";
for my $cat(qw< Total ASCII Unicode >) {
printf $mask, $width => $cats{$cat} || 0, $cat;
}
print "\n";
my @catnames = qw[
L Lu Ll Lt Lm Lo
N Nd Nl No
S Sm Sc Sk So
P Pc Pd Ps Pe Pi Pf Po
M Mn Mc Me
Z Zs Zl Zp
C Cc Cf Cs Co Cn
];
#for my $cat (sort keys %cats) {
for my $cat (@catnames) {
next if length($cat) > 2;
next unless $cats{$cat};
my $prop = length($cat) == 1
? ( " " . q<\p> . $cat )
: ( q<\p> . "{$cat}" . "\t" )
;
my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat});
printf $mask, $width => $cats{$cat}, $desc;
}
exit;
sub get_general_category {
my $_ = shift();
return "L" if /\pL/;
return "S" if /\pS/;
return "P" if /\pP/;
return "N" if /\pN/;
return "C" if /\pC/;
return "M" if /\pM/;
return "Z" if /\pZ/;
die "not reached one: $_";
}
sub get_general_subcategory {
my $_ = shift();
return "Lu" if /\p{Lu}/;
return "Ll" if /\p{Ll}/;
return "Lt" if /\p{Lt}/;
return "Lm" if /\p{Lm}/;
return "Lo" if /\p{Lo}/;
return "Mn" if /\p{Mn}/;
return "Mc" if /\p{Mc}/;
return "Me" if /\p{Me}/;
return "Nd" if /\p{Nd}/;
return "Nl" if /\p{Nl}/;
return "No" if /\p{No}/;
return "Pc" if /\p{Pc}/;
return "Pd" if /\p{Pd}/;
return "Ps" if /\p{Ps}/;
return "Pe" if /\p{Pe}/;
return "Pi" if /\p{Pi}/;
return "Pf" if /\p{Pf}/;
return "Po" if /\p{Po}/;
return "Sm" if /\p{Sm}/;
return "Sc" if /\p{Sc}/;
return "Sk" if /\p{Sk}/;
return "So" if /\p{So}/;
return "Zs" if /\p{Zs}/;
return "Zl" if /\p{Zl}/;
return "Zp" if /\p{Zp}/;
return "Cc" if /\p{Cc}/;
return "Cf" if /\p{Cf}/;
return "Cs" if /\p{Cs}/;
return "Co" if /\p{Co}/;
return "Cn" if /\p{Cn}/;
die "not reached two: <$_> " . sprintf("U+%vX", $_);
}
sub build_prop_table {
for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) {
L Letter
Lu Uppercase_Letter
Ll Lowercase_Letter
Lt Titlecase_Letter
Lm Modifier_Letter
Lo Other_Letter
M Mark (combining characters, including diacritics)
Mn Nonspacing_Mark
Mc Spacing_Mark
Me Enclosing_Mark
N Number
Nd Decimal_Number (also Digit)
Nl Letter_Number
No Other_Number
P Punctuation
Pc Connector_Punctuation
Pd Dash_Punctuation
Ps Open_Punctuation
Pe Close_Punctuation
Pi Initial_Punctuation (may behave like Ps or Pe depending on usage)
Pf Final_Punctuation (may behave like Ps or Pe depending on usage)
Po Other_Punctuation
S Symbol
Sm Math_Symbol
Sc Currency_Symbol
Sk Modifier_Symbol
So Other_Symbol
Z Separator
Zs Space_Separator
Zl Line_Separator
Zp Paragraph_Separator
C Other (means not L/N/P/S/Z)
Cc Control (also Cntrl)
Cf Format
Cs Surrogate (not usable)
Co Private_Use
Cn Unassigned
End_of_Property_List
my($short_prop, $long_prop) = $line =~ m{
\b
( \p{Lu} \p{Ll} ? )
\s +
( \p{Lu} [\p{L&}_] + )
\b
}x;
$Prop_Table{$short_prop} = $long_prop;
}
}
例如:
$ unicats book.txt
2357232 Total
2357199 ASCII
33 Unicode
1604949 \pL Letter
74455 \p{Lu} Uppercase_Letter
1530485 \p{Ll} Lowercase_Letter
9 \p{Lo} Other_Letter
10676 \pN Number
10676 \p{Nd} Decimal_Number
19679 \pS Symbol
10705 \p{Sm} Math_Symbol
8365 \p{Sc} Currency_Symbol
603 \p{Sk} Modifier_Symbol
6 \p{So} Other_Symbol
111899 \pP Punctuation
2996 \p{Pc} Connector_Punctuation
6145 \p{Pd} Dash_Punctuation
11392 \p{Ps} Open_Punctuation
11371 \p{Pe} Close_Punctuation
79995 \p{Po} Other_Punctuation
548529 \pZ Separator
548529 \p{Zs} Space_Separator
61500 \pC Other
61500 \p{Cc} Control
答案 2 :(得分:0)
至于使用* nix命令,上面的答案是好的,但它没有得到使用统计数据。
但是,如果你真的想要在文件上使用统计数据(比如最稀有,中位数,最常用等),那么Python应该这样做。
def get_char_counts(fname):
f = open(fname)
usage = {}
for c in f.read():
if c not in usage:
usage.update({c:1})
else:
usage[c] += 1
return usage