如何使用Unix命令在文本文件中收集字符使用情况统计信息?

时间:2010-11-13 18:45:12

标签: unix command-line text statistics ocr

我有一个使用OCR软件创建的文本文件 - 大小约为1兆字节。 一些不常见的字符出现在整个文档中,其中大多数都是OCR错误。

我想找到文档中使用的所有字符,以便轻松发现错误(例如UNIQ命令,但是对于字符,而不是行。)

我在Ubuntu上。 我应该使用什么Unix命令来显示文本文件中使用的所有字符?

3 个答案:

答案 0 :(得分:9)

这应该做你想要的:

cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c

前提是sed将文件中的每个字符单独放在一行上,然后通常的sort | uniq -c序列除了发生的每个唯一字符之外的所有字符,并提供计数每次发生多少次。

此外,您可以将| sort -n附加到整个序列的末尾,以按输出每个字符的次数对输出进行排序。例如:

$ echo hello |  sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n
  1 
  1 e
  1 h
  1 o
  2 l

答案 1 :(得分:1)

这样做:

#!/usr/bin/perl -n
#
# charcounts - show how many times each code point is used
# Tom Christiansen <tchrist@perl.com>

use open ":utf8";

++$seen{ ord() } for split //;

END {
    for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) {
        printf "%04X %d\n", $cp, $seen{$cp};
    }
}

自行运行,该程序产生:

$ charcounts /tmp/charcounts | head
0020 46
0065 20
0073 18
006E 15
000A 14
006F 12
0072 11
0074 10
0063 9
0070 9

如果你想要角色的文字和/或名字,也很容易添加。

如果你想要更复杂的东西,这个程序通过Unicode属性计算出字符。它可能足以满足您的目的,如果没有,您应该能够适应它。

#!/usr/bin/perl
#
# unicats - show character distribution by Unicode character property
# Tom Christiansen <tchrist@perl.com>

use strict;
use warnings qw<FATAL all>;

use open ":utf8";

my %cats;

our %Prop_Table;
build_prop_table();

if (@ARGV == 0 && -t STDIN) {
    warn <<"END_WARNING";
$0: reading UTF-8 character data directly from your tty
\tSo please type stuff...
\t and then hit your tty's EOF sequence when done.
END_WARNING

} 

while (<>) {
    for (split(//)) {

        $cats{Total}++;

        if (/\p{ASCII}/) { $cats{ASCII}++   } 
        else             { $cats{Unicode}++ } 

        my $gcat   = get_general_category($_);
        $cats{$gcat}++;

        my $subcat = get_general_subcategory($_);
        $cats{$subcat}++;

    } 
} 

my $width = length $cats{Total};

my $mask = "%*d %s\n";

for my $cat(qw< Total ASCII Unicode >) { 
    printf $mask, $width => $cats{$cat} || 0, $cat; 
}
print "\n";

my @catnames = qw[
    L Lu Ll Lt Lm Lo
    N Nd Nl No
    S Sm Sc Sk So
    P Pc Pd Ps Pe Pi Pf Po
    M Mn Mc Me
    Z Zs Zl Zp
    C Cc Cf Cs Co Cn
];

#for my $cat (sort keys %cats) {
for my $cat (@catnames) {
    next if length($cat) > 2;
    next unless $cats{$cat};

    my $prop = length($cat) == 1 
                 ? ( " " . q<\p> .   $cat          )
                 : (       q<\p> . "{$cat}" . "\t" )
             ;

    my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat});

    printf $mask, $width => $cats{$cat}, $desc;
} 

exit;

sub get_general_category {
    my $_ = shift();
    return "L" if /\pL/;
    return "S" if /\pS/;
    return "P" if /\pP/;
    return "N" if /\pN/;
    return "C" if /\pC/;
    return "M" if /\pM/;
    return "Z" if /\pZ/;

    die "not reached one: $_";
} 

sub get_general_subcategory {
    my $_ = shift();

    return "Lu" if /\p{Lu}/;
    return "Ll" if /\p{Ll}/;
    return "Lt" if /\p{Lt}/;
    return "Lm" if /\p{Lm}/;
    return "Lo" if /\p{Lo}/;

    return "Mn" if /\p{Mn}/;
    return "Mc" if /\p{Mc}/;
    return "Me" if /\p{Me}/;

    return "Nd" if /\p{Nd}/;
    return "Nl" if /\p{Nl}/;
    return "No" if /\p{No}/;

    return "Pc" if /\p{Pc}/;
    return "Pd" if /\p{Pd}/;
    return "Ps" if /\p{Ps}/;
    return "Pe" if /\p{Pe}/;
    return "Pi" if /\p{Pi}/;
    return "Pf" if /\p{Pf}/;
    return "Po" if /\p{Po}/;

    return "Sm" if /\p{Sm}/;
    return "Sc" if /\p{Sc}/;
    return "Sk" if /\p{Sk}/;
    return "So" if /\p{So}/;

    return "Zs" if /\p{Zs}/;
    return "Zl" if /\p{Zl}/;
    return "Zp" if /\p{Zp}/;

    return "Cc" if /\p{Cc}/;
    return "Cf" if /\p{Cf}/;
    return "Cs" if /\p{Cs}/;
    return "Co" if /\p{Co}/;
    return "Cn" if /\p{Cn}/;

    die "not reached two: <$_> " . sprintf("U+%vX", $_);

}

sub build_prop_table { 

    for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) {

       L           Letter
       Lu          Uppercase_Letter
       Ll          Lowercase_Letter
       Lt          Titlecase_Letter
       Lm          Modifier_Letter
       Lo          Other_Letter

       M           Mark  (combining characters, including diacritics)
       Mn          Nonspacing_Mark
       Mc          Spacing_Mark
       Me          Enclosing_Mark

       N           Number
       Nd          Decimal_Number (also Digit)
       Nl          Letter_Number
       No          Other_Number

       P           Punctuation
       Pc          Connector_Punctuation
       Pd          Dash_Punctuation
       Ps          Open_Punctuation
       Pe          Close_Punctuation
       Pi          Initial_Punctuation (may behave like Ps or Pe depending on usage)
       Pf          Final_Punctuation (may behave like Ps or Pe depending on usage)
       Po          Other_Punctuation

       S           Symbol
       Sm          Math_Symbol
       Sc          Currency_Symbol
       Sk          Modifier_Symbol
       So          Other_Symbol

       Z           Separator
       Zs          Space_Separator
       Zl          Line_Separator
       Zp          Paragraph_Separator

       C           Other (means not L/N/P/S/Z)
       Cc          Control (also Cntrl)
       Cf          Format
       Cs          Surrogate   (not usable)
       Co          Private_Use
       Cn          Unassigned

End_of_Property_List

            my($short_prop, $long_prop) = $line =~ m{ 
                \b 
                 ( \p{Lu}  \p{Ll}   ? ) 
                \s + 
                 ( \p{Lu} [\p{L&}_] + )
                \b
            }x;

            $Prop_Table{$short_prop} = $long_prop;

    }

}

例如:

$ unicats book.txt
2357232 Total
2357199 ASCII
     33 Unicode

1604949  \pL   Letter
  74455 \p{Lu}   Uppercase_Letter
1530485 \p{Ll}   Lowercase_Letter
      9 \p{Lo}   Other_Letter
  10676  \pN   Number
  10676 \p{Nd}   Decimal_Number
  19679  \pS   Symbol
  10705 \p{Sm}   Math_Symbol
   8365 \p{Sc}   Currency_Symbol
    603 \p{Sk}   Modifier_Symbol
      6 \p{So}   Other_Symbol
 111899  \pP   Punctuation
   2996 \p{Pc}   Connector_Punctuation
   6145 \p{Pd}   Dash_Punctuation
  11392 \p{Ps}   Open_Punctuation
  11371 \p{Pe}   Close_Punctuation
  79995 \p{Po}   Other_Punctuation
 548529  \pZ   Separator
 548529 \p{Zs}   Space_Separator
  61500  \pC   Other
  61500 \p{Cc}   Control

答案 2 :(得分:0)

至于使用* nix命令,上面的答案是好的,但它没有得到使用统计数据。

但是,如果你真的想要在文件上使用统计数据(比如最稀有,中位数,最常用等),那么Python应该这样做。

def get_char_counts(fname):
    f = open(fname)
    usage = {}
    for c in f.read():
        if c not in usage:
            usage.update({c:1})
        else:
            usage[c] += 1
    return usage