Question

是否有人拥有针对unicode的字符串程序的代码示例？编程语言并不重要。我想要一些与unix命令“strings”基本相同的东西，但它也可以在unicode文本（UTF-16或UTF-8）上运行，拉动英语字符和标点符号的运行。（我只关心英文字符，而不关心任何其他字母）。

谢谢！

Answer 1

你只是想使用它，还是因某些原因坚持使用代码？

在我的Debian系统上，似乎strings命令可以开箱即用。请参阅联机帮助页中的exercept：

  --encoding=encoding
       Select the character encoding of the strings that are to be found.  Possible values for encoding are: s = single-7-bit-byte characters (ASCII, ISO  8859,
       etc.,  default),  S  = single-8-bit-byte characters, b = 16-bit bigendian, l = 16-bit littleendian, B = 32-bit bigendian, L = 32-bit littleendian. Useful
       for finding wide character strings.

编辑：好的。我不知道C＃所以这可能有点毛茸茸，但基本上，你需要搜索交替的零和英文字符的序列。

byte b;
int i=0;
while(!endOfInput()) {
  b=getNextByte();
LoopBegin:
  if(!isEnglish(b)) {
    if(i>0) // report successful match of length i
    i=0;
    continue;
  }
  if(endOfInput()) break;
  if((b=getNextByte())!=0)
    goto LoopBegin;
  i++; // found another character
}

这应该适用于小端。

Answer 2

我有类似的问题并尝试了“strings -e ...”，但我刚刚找到了修复宽度字符编码的选项。（UTF-8编码是可变宽度）。

默认情况下记得ascii以外的字符需要额外的strings个选项。这包括几乎所有非英语语言字符串。

然而，“-e S”（单个8位字符）输出包括UTF-8字符。

我写了一个非常简单（见解）的Perl脚本，它应用了一个 “strings -e S ... | iconv ...”到输入文件。

我认为根据具体限制调整它很容易。用法：utf8strings [options] file*

#!/usr/bin/perl -s

our ($all,$windows,$enc);   ## use -all ignore the "3 letters word" restriction
use strict;
use utf8::all;

$enc = "ms-ansi" if     $windows;  ##
$enc = "utf8"    unless $enc    ;  ## defaul encoding=utf8
my $iconv = "iconv -c -f $enc -t utf8 |";

for (@ARGV){ s/(.*)/strings -e S '$1'| $iconv/;}

my $word=qr/[a-zçáéíóúâêôàèìòùüãõ]{3}/i;   # adapt this to your case

while(<>){
   # next if /regular expressions for common garbage/; 
   print    if ($all or /$word/);
}

在某些情况下，这种方法会产生一些额外的垃圾。

支持Unicode的字符串（1）程序

2 个答案: