如何在Java脚本中找到有关Unicode字符(例如,它所属的字符集)的信息?
E.g。
00e9 LATIN SMALL LETTER E WITH ACUTE
0bf2 TAMIL NUMBER ONE THOUSAND
我知道有一种方法可以使用unicodedata
库在python中查找有关Unicode代码点的详细信息。有没有办法在JS中找到这些信息?
PS:我正在使用它进行chrome扩展开发,因此使用其API的解决方案也很好。
答案 0 :(得分:2)
不幸的是,这几乎是不可能的。定义每种语言使用的字符非常困难。 (例如,英语肯定会在\ u0000到\ u007F之外使用许多字符,例如破折号和“é”在法语起源的许多单词中。你在哪里绘制限制。)在CLDR数据库中为语言定义了一些字符集合,但那里的选择可能会受到质疑。对于许多语言来说,集合是如此庞大和稀疏(就Unicode编码空间而言),它们的任何正则表达式都会很长。
所以硬编码范围甚至不够;你需要一组范围和个别字符。
也许最重要的问题是:你会用它做什么?需要根据这些技术进行评估。总的来说,JavaScript在国际化方面非常原始且有限。
答案 1 :(得分:1)
正则表达式中有强大的Unicode支持:http://www.regular-expressions.info/unicode.html
但是自从es6以来,JavaScript中仅支持这些功能。即使在Chrome中也没有实现。也许,它将在您完成代码时实施。
此外,即使是英语,事情也不是那么简单: café,naïve,coördinator。
答案 2 :(得分:1)
英语文本由Latin,Common和Inherited脚本以及某些语料库中的代码点支配,也包括希腊语。
例如,PubMed Open Access集合是一个非常大的所有英文文本集合,填充了非ASCII代码点。其中90%完全由36个不同的代码点占据,如下所示:
rank percent cumulative code glyph GC=?? Name
---------------------------------------------------------------------
1 18.553% 18.553% U+02013 ‹–› GC=Pd EN DASH
2 7.422% 25.974% U+000A0 ‹ › GC=Zs NO-BREAK SPACE
3 7.033% 33.007% U+000B1 ‹±› GC=Sm PLUS-MINUS SIGN
4 5.461% 38.469% U+02212 ‹−› GC=Sm MINUS SIGN
5 4.196% 42.664% U+02003 ‹ › GC=Zs EM SPACE
6 3.682% 46.346% U+003BC ‹μ› GC=Ll GREEK SMALL LETTER MU
7 3.619% 49.965% U+003B2 ‹β› GC=Ll GREEK SMALL LETTER BETA
8 3.568% 53.534% U+003B1 ‹α› GC=Ll GREEK SMALL LETTER ALPHA
9 3.426% 56.959% U+0200A ‹ › GC=Zs HAIR SPACE
10 3.221% 60.181% U+000B0 ‹°› GC=So DEGREE SIGN
11 2.931% 63.112% U+02009 ‹ › GC=Zs THIN SPACE
12 2.620% 65.732% U+02019 ‹’› GC=Pf RIGHT SINGLE QUOTATION MARK
13 2.506% 68.238% U+02032 ‹′› GC=Po PRIME
14 2.441% 70.679% U+000D7 ‹×› GC=Sm MULTIPLICATION SIGN
15 2.042% 72.722% U+0201D ‹”› GC=Pf RIGHT DOUBLE QUOTATION MARK
16 2.039% 74.761% U+0201C ‹“› GC=Pi LEFT DOUBLE QUOTATION MARK
17 1.536% 76.296% U+00394 ‹Δ› GC=Lu GREEK CAPITAL LETTER DELTA
18 1.415% 77.712% U+000B5 ‹µ› GC=Ll MICRO SIGN
19 1.337% 79.049% U+003B3 ‹γ› GC=Ll GREEK SMALL LETTER GAMMA
20 1.210% 80.259% U+000E9 ‹é› GC=Ll LATIN SMALL LETTER E WITH ACUTE
21 1.152% 81.410% U+02014 ‹—› GC=Pd EM DASH
22 1.135% 82.546% U+02018 ‹‘› GC=Pi LEFT SINGLE QUOTATION MARK
23 0.998% 83.543% U+000A9 ‹©› GC=So COPYRIGHT SIGN
24 0.710% 84.253% U+02265 ‹≥› GC=Sm GREATER-THAN OR EQUAL TO
25 0.600% 84.853% U+000F6 ‹ö› GC=Ll LATIN SMALL LETTER O WITH DIAERESIS
26 0.599% 85.452% U+000B7 ‹·› GC=Po MIDDLE DOT
27 0.597% 86.049% U+02022 ‹•› GC=Po BULLET
28 0.594% 86.644% U+0223C ‹∼› GC=Sm TILDE OPERATOR
29 0.573% 87.217% U+003BA ‹κ› GC=Ll GREEK SMALL LETTER KAPPA
30 0.569% 87.785% U+000FC ‹ü› GC=Ll LATIN SMALL LETTER U WITH DIAERESIS
31 0.493% 88.278% U+02264 ‹≤› GC=Sm LESS-THAN OR EQUAL TO
32 0.440% 88.718% U+000AE ‹®› GC=So REGISTERED SIGN
33 0.433% 89.152% U+000E4 ‹ä› GC=Ll LATIN SMALL LETTER A WITH DIAERESIS
34 0.422% 89.573% U+02020 ‹†› GC=Po DAGGER
35 0.407% 89.980% U+003B4 ‹δ› GC=Ll GREEK SMALL LETTER DELTA
检测这些的一种方法是使用Unicode正则表达式,该表达式表示字符必须来自Latin,Greek,Common或Inherited脚本。
在这个语料库中,前四个包含超过99%的代码点。但是,此数据集中还有许多超低频代码点不属于这四个脚本(例如 Cyrillic,Han,Kana,Hangul等)。如果您将输入限制在之前列出的四个超常用脚本中,则会将这些输出作为误报。此数据集中有239个此类不同的代码点,其中最常见的前50个代码点如下:
rank percent cumulative code glyph GC=?? Name
---------------------------------------------------------------------
295 0.002% 99.828% U+00424 ‹Ф› GC=Lu CYRILLIC CAPITAL LETTER EF
381 0.001% 99.916% U+0043A ‹к› GC=Ll CYRILLIC SMALL LETTER KA
454 0.000% 99.949% U+00413 ‹Г› GC=Lu CYRILLIC CAPITAL LETTER GHE
491 0.000% 99.959% U+0AD6D ‹국› GC=Lo HANGUL SYLLABLE GUG
499 0.000% 99.961% U+003EC ‹Ϭ› GC=Lu COPTIC CAPITAL LETTER SHIMA
513 0.000% 99.965% U+00406 ‹І› GC=Lu CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
528 0.000% 99.968% U+00416 ‹Ж› GC=Lu CYRILLIC CAPITAL LETTER ZHE
534 0.000% 99.969% U+00430 ‹а› GC=Ll CYRILLIC SMALL LETTER A
539 0.000% 99.970% U+0041F ‹П› GC=Lu CYRILLIC CAPITAL LETTER PE
545 0.000% 99.971% U+00421 ‹С› GC=Lu CYRILLIC CAPITAL LETTER ES
553 0.000% 99.972% U+0D55C ‹한› GC=Lo HANGUL SYLLABLE HAN
555 0.000% 99.972% U+00404 ‹Є› GC=Lu CYRILLIC CAPITAL LETTER UKRAINIAN IE
566 0.000% 99.974% U+0C5B4 ‹어› GC=Lo HANGUL SYLLABLE EO
567 0.000% 99.974% U+0041A ‹К› GC=Lu CYRILLIC CAPITAL LETTER KA
568 0.000% 99.974% U+0041B ‹Л› GC=Lu CYRILLIC CAPITAL LETTER EL
571 0.000% 99.975% U+0B2C8 ‹니› GC=Lo HANGUL SYLLABLE NI
575 0.000% 99.975% U+0AE4C ‹까› GC=Lo HANGUL SYLLABLE GGA
578 0.000% 99.976% U+00428 ‹Ш› GC=Lu CYRILLIC CAPITAL LETTER SHA
579 0.000% 99.976% U+00454 ‹є› GC=Ll CYRILLIC SMALL LETTER UKRAINIAN IE
585 0.000% 99.977% U+00418 ‹И› GC=Lu CYRILLIC CAPITAL LETTER I
587 0.000% 99.977% U+0B2E4 ‹다› GC=Lo HANGUL SYLLABLE DA
600 0.000% 99.978% U+00440 ‹р› GC=Ll CYRILLIC SMALL LETTER ER
610 0.000% 99.980% U+00457 ‹ї› GC=Ll CYRILLIC SMALL LETTER YI
614 0.000% 99.980% U+0C74C ‹음› GC=Lo HANGUL SYLLABLE EUM
623 0.000% 99.981% U+0BD80 ‹부› GC=Lo HANGUL SYLLABLE BU
624 0.000% 99.981% U+0C545 ‹악› GC=Lo HANGUL SYLLABLE AG
625 0.000% 99.981% U+0C778 ‹인› GC=Lo HANGUL SYLLABLE IN
640 0.000% 99.982% U+0C5D0 ‹에› GC=Lo HANGUL SYLLABLE E
641 0.000% 99.983% U+0C744 ‹을› GC=Lo HANGUL SYLLABLE EUL
645 0.000% 99.983% U+00438 ‹и› GC=Ll CYRILLIC SMALL LETTER I
664 0.000% 99.984% U+0041C ‹М› GC=Lu CYRILLIC CAPITAL LETTER EM
665 0.000% 99.984% U+00436 ‹ж› GC=Ll CYRILLIC SMALL LETTER ZHE
674 0.000% 99.985% U+0C774 ‹이› GC=Lo HANGUL SYLLABLE I
678 0.000% 99.985% U+00431 ‹б› GC=Ll CYRILLIC SMALL LETTER BE
679 0.000% 99.986% U+00435 ‹е› GC=Ll CYRILLIC SMALL LETTER IE
689 0.000% 99.986% U+0B300 ‹대› GC=Lo HANGUL SYLLABLE DAE
690 0.000% 99.986% U+0BD84 ‹분› GC=Lo HANGUL SYLLABLE BUN
691 0.000% 99.986% U+0C678 ‹외› GC=Lo HANGUL SYLLABLE OE
696 0.000% 99.987% U+005DB ‹כ› GC=Lo HEBREW LETTER KAF
703 0.000% 99.987% U+0B85C ‹로› GC=Lo HANGUL SYLLABLE RO
711 0.000% 99.988% U+0041D ‹Н› GC=Lu CYRILLIC CAPITAL LETTER EN
712 0.000% 99.988% U+004D9 ‹ә› GC=Ll CYRILLIC SMALL LETTER SCHWA
725 0.000% 99.988% U+0B294 ‹는› GC=Lo HANGUL SYLLABLE NEUN
726 0.000% 99.988% U+0B9CC ‹만› GC=Lo HANGUL SYLLABLE MAN
727 0.000% 99.988% U+0C11C ‹서› GC=Lo HANGUL SYLLABLE SEO
728 0.000% 99.989% U+0C2B5 ‹습› GC=Lo HANGUL SYLLABLE SEUB
729 0.000% 99.989% U+0C601 ‹영› GC=Lo HANGUL SYLLABLE YEONG
741 0.000% 99.989% U+00441 ‹с› GC=Ll CYRILLIC SMALL LETTER ES
742 0.000% 99.989% U+00444 ‹ф› GC=Ll CYRILLIC SMALL LETTER EF
743 0.000% 99.989% U+004B0 ‹Ұ› GC=Lu CYRILLIC CAPITAL LETTER STRAIGHT U WITH STROKE
在这239个不同的跨ASCII码点中,其中59个也在Unicode的基本多语言平面之外,因此任何处理都必须能够处理全范围的Unicode。除了其中一个之外的所有都是数学字母。这些是前20个:
rank percent cumulative code glyph GC=?? Name
---------------------------------------------------------------------
227 0.004% 99.660% U+1D49E ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL C
240 0.003% 99.704% U+1D4AF ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL T
252 0.003% 99.738% U+1D4AE ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL S
275 0.002% 99.791% U+1D49F ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL D
279 0.002% 99.799% U+1D4B3 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL X
289 0.002% 99.818% U+1D4A9 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL N
291 0.002% 99.821% U+1D4AB ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL P
292 0.002% 99.823% U+1D4A2 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL G
313 0.001% 99.854% U+1D49C ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL A
316 0.001% 99.858% U+1D53C ‹› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL E
341 0.001% 99.884% U+1D4AA ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL O
430 0.000% 99.941% U+1D4A5 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL J
450 0.000% 99.948% U+1D4A6 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL K
458 0.000% 99.950% U+1D4B1 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL V
461 0.000% 99.951% U+1D4B2 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL W
468 0.000% 99.953% U+1D4B4 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL Y
469 0.000% 99.954% U+1D4B5 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL Z
500 0.000% 99.962% U+1D4B0 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL U
518 0.000% 99.966% U+1D4AC ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL Q
560 0.000% 99.973% U+1D54A ‹› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL S
其他语料库会有所不同。您必须知道您的数据集。