有没有办法在Ruby中枚举所有字符的Unicode属性?我可以使用Ruby 1.9的Regexp类来测试给定字符是否具有特定属性(例如,some_char =~ /\p{P}/
来测试some_char
是否是标点符号等)...但是因为字符可以有多个例如,属性((
是标点符号和 ASCII等),能够获得所有角色属性的列表会很好。
我可以使用unicode_data.txt
手动执行此操作,或者不管它叫什么,但这似乎可能已经在某处完成了。 UnicodeUtils
似乎没有任何内容,谷歌搜索没有发现任何明显的东西。谢谢!
答案 0 :(得分:5)
您可以呼叫我的uniprops script。
$ uniprops -p delta greek:delta Greek:Delta
U+1E9F ‹ẟ› \N{ LATIN SMALL LETTER DELTA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
U+03B4 ‹δ› \N{ GREEK SMALL LETTER DELTA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
U+0394 ‹Δ› \N{ GREEK CAPITAL LETTER DELTA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
$ uniprops \# ç π
U+0023 ‹#› \N{ NUMBER SIGN }:
\pP \p{Po}
All Any ASCII Assigned Common Zyyy Po P Gr_Base
Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
Pattern_Syntax PatSyn PosixGraph PosixPrint PosixPunct
Print Punctuation
U+00E7 ‹ç› \N{ LATIN SMALL LETTER C WITH CEDILLA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased
Cased_Letter LC Changes_When_Casemapped CWCM
Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC
ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower
Lowercase Print Word XID_Continue XIDC XID_Start XIDS
U+03C0 ‹π› \N{ GREEK SMALL LETTER PI }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek
InGreek Cased Cased_Letter LC Changes_When_Casemapped CWCM
Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic
ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter
Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
$ uniprops -a 'MICRO SIGN'
U+00B5 ‹µ› \N{MICRO SIGN}
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM
Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Common Zyyy Ll L Gr_Base Grapheme_Base
Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word
XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Latin_1 Block=Latin_1_Supplement BLK=Latin1 Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=Com
Decomposition_Type=Compat DT=Com Decomposition_Type=Non_Canon Decomposition_Type=Non_Canonical DT=NonCanon East_Asian_Width=Neutral
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic
LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1
Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0
Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=LO Sentence_Break=Lower SB=LO
Word_Break=ALetter WB=LE Word_Break=LE _X_Begin
$ uniprops -a 2011
U+2011 ‹‑› \N{NON-BREAKING HYPHEN}
\pP \p{Pd}
All Any Assigned InGeneralPunctuation Changes_When_NFKC_Casefolded CWKCF Common Zyyy Dash Dash_Punctuation Pd P General_Punctuation
Gr_Base Grapheme_Base Graph GrBase Punct Pat_Syn Pattern_Syntax PatSyn Print Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct
Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=General_Punctuation Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=Nb
Decomposition_Type=Nobreak DT=Nb Decomposition_Type=Non_Canon Decomposition_Type=Non_Canonical DT=NonCanon East_Asian_Width=Neutral
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=GL Line_Break=Glue LB=GL
Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0
IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1
IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX Word_Break=Other
WB=XX Word_Break=XX _X_Begin
$ uniprops -l | grep Greek | sort -dfu
Blk=Greek
Block:Ancient_Greek_Musical_Notation
Block:Ancient_Greek_Numbers
Block:Greek
Block=Greek_And_Coptic
Block:Greek_Extended
Greek
Greek_And_Coptic
InAncientGreekMusicalNotation
InAncientGreekNumbers
InGreek
InGreekExtended
Is_Greek
Script=Greek
你可能也希望获得unichars,这样你就可以走另一条路了。以下是调用它的示例:
$ unichars -gns '\p{Cased}' '\p{Number}'
$ unichars '\R'
$ unichars '\S' '[\v\h]'
$ unichars '\S' '\p{space}'
$ unichars '\pL' '\p{Greek}'
$ unichars '\pL' '\p{Greek}' | um
$ unichars '\p{Age=6.0}' | um
$ unichars '\p{Lowercase}' '\P{Lowercase_Letter}'
$ unichars '\p{Lower}' '\P{Ll}' # same but easier to type
$ unichars -a '\p{alphabetic}' '\P{Letter}' | wc -l # 1006 code points
$ unichars -gas '\PL' '\p{Cased}'
$ unichars -gas '\P{MARK}' '\p{diacritic}' # 209 code points
$ unichars -gas '\pM' '\P{BC=NSM}'
$ unichars -gas '\p{Cased}' '[^\p{CWL}\p{CWT}\p{CWU}]'
$ unichars -gas '\p{Dash}'
$ unichars -gas '\p{mark}' '\P{DIACRITIC}' # 1068 code points
$ unichars -gas 'grep { length > 1 } lc, ucfirst, uc'
$ unichars -gas 'uc ne ucfirst'
$ unichars -gasn NUM
以下是输出的一个示例:
$ unichars -gsn NUM 'int NUM ne NUM'
0 U+0030 GC=Nd 0=NV SC=Common DIGIT ZERO
¼ U+00BC GC=No 1/4=NV SC=Common VULGAR FRACTION ONE QUARTER
½ U+00BD GC=No 1/2=NV SC=Common VULGAR FRACTION ONE HALF
¾ U+00BE GC=No 3/4=NV SC=Common VULGAR FRACTION THREE QUARTERS
٠ U+0660 GC=Nd 0=NV SC=Common ARABIC-INDIC DIGIT ZERO
۰ U+06F0 GC=Nd 0=NV SC=Arabic EXTENDED ARABIC-INDIC DIGIT ZERO
߀ U+07C0 GC=Nd 0=NV SC=Nko NKO DIGIT ZERO
० U+0966 GC=Nd 0=NV SC=Devanagari DEVANAGARI DIGIT ZERO
০ U+09E6 GC=Nd 0=NV SC=Bengali BENGALI DIGIT ZERO
৴ U+09F4 GC=No 1/16=NV SC=Bengali BENGALI CURRENCY NUMERATOR ONE
৵ U+09F5 GC=No 1/8=NV SC=Bengali BENGALI CURRENCY NUMERATOR TWO
৶ U+09F6 GC=No 3/16=NV SC=Bengali BENGALI CURRENCY NUMERATOR THREE
৷ U+09F7 GC=No 1/4=NV SC=Bengali BENGALI CURRENCY NUMERATOR FOUR
৸ U+09F8 GC=No 3/4=NV SC=Bengali BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR
੦ U+0A66 GC=Nd 0=NV SC=Gurmukhi GURMUKHI DIGIT ZERO
૦ U+0AE6 GC=Nd 0=NV SC=Gujarati GUJARATI DIGIT ZERO
୦ U+0B66 GC=Nd 0=NV SC=Oriya ORIYA DIGIT ZERO
୲ U+0B72 GC=No 1/4=NV SC=Oriya ORIYA FRACTION ONE QUARTER
୳ U+0B73 GC=No 1/2=NV SC=Oriya ORIYA FRACTION ONE HALF
୴ U+0B74 GC=No 3/4=NV SC=Oriya ORIYA FRACTION THREE QUARTERS
୵ U+0B75 GC=No 1/16=NV SC=Oriya ORIYA FRACTION ONE SIXTEENTH
୶ U+0B76 GC=No 1/8=NV SC=Oriya ORIYA FRACTION ONE EIGHTH
୷ U+0B77 GC=No 3/16=NV SC=Oriya ORIYA FRACTION THREE SIXTEENTHS
等
我在OSCON Unicode talks的第一个中描述了这些。这些只是其中几十个工具中的两个工具。
答案 1 :(得分:0)
有一个unicode_data.txt interface by runpaint,效果很好,但将自己描述为“非常早期的草稿”。