为什么我不能在单词边界旁边使用重音字符?

时间:2010-03-15 19:15:48

标签: javascript regex unicode replace diacritics

我正在尝试制作一个与人名相匹配的动态正则表达式。它在大多数名称上都没有问题,直到我在名称的末尾遇到重音字符。

示例:一些FancyNamé

到目前为止我使用的正则表达式是:

/\b(Fancy Namé|Namé)\b/i

像这样使用:

"Goal: Some Fancy Namé. Awesome.".replace(/\b(Fancy Namé|Namé)\b/i, '<a href="#">$1</a>');

这根本不匹配。如果我用e替换é,它就匹配得很好。 如果我尝试匹配诸如“SomeFancyNaméa”这样的名字,它就可以了。 如果我删除单词最后一个词边界锚,它就可以正常工作。

为什么字边界标志不起作用?关于如何解决这个问题的任何建议?

我考虑使用类似的东西,但我不确定性能惩罚会是什么样的:

"Some fancy namé. Allow me to ellaborate.".replace(/([\s.,!?])(fancy namé|namé)([\s.,!?]|$)/g, '$1<a href="#">$2</a>$3')

连连呢?想法?

7 个答案:

答案 0 :(得分:14)

JavaScript的正则表达式实现不支持Unicode。它只知道标准低字节ASCII中的“单词字符”,它不包括é或任何其他重音或非英语字母。

因为é不是JS的单词字符,所以后跟空格的é永远不能被视为单词边界。 (如果在单词的中间使用\b,则匹配Namés

  

/([\s.,!?])(fancy namé|namé)([\s.,!?]|$)/

是的,这将是JS的常用解决方法(尽管可能有更多的标点字符)。对于其他语言,你通常使用lookahead / lookbehind来避免匹配前后边界字符,但这些在JS中支持不足/错误,所以最好避免使用。

答案 1 :(得分:7)

罗布是对的。引自ECMAScript第3版:

15.10.2.6断言:

  

生产断言 \b按......评估

     

2。调用 IsWordChar(e-1)并让 a 成为布尔结果
   3。调用 IsWordChar(e)并让 b 成为布尔结果

  

内部帮助函数 IsWordChar ...执行以下操作:

     

3. 如果 c 是下表中的六十三个字符之一,请返回 true

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9 _

由于é不是这63个字符中的一个,因此éa之间的位置将被视为字边界。

如果您知道字符类,则可以使用负前瞻断言,例如

/(^|[^\wÀ-ÖØ-öø-ſ])(Fancy Namé|Namé)(?![\wÀ-ÖØ-öø-ſ])/

答案 2 :(得分:4)

了解你的边界

不幸的是,即使有一天Javascript应该对Unicode有充分和适当的支持,你仍然仍然必须非常谨慎地使用字边界。很容易误解\b真正的作用。

以下是解释\b实际正在做什么的Perl代码,无论您的模式引擎是否已进行BNM升级,都是如此:

  # if next is word char:
  #     then last isn't    word
  #     else last isn't nonword

    $word_boundary_before = qr{ (?(?=  \w ) (?<! \w ) | (?<! \W ) ) }x;

  # if last is word:
  #     then next isn't    word
  #     else next isn't nonword

    $word_boundary_after  = qr{ (?(?<= \w ) (?!  \w ) | (?!  \W ) ) }x;

第一个就像\b之前的东西,第二个就像之后的\b。使用的构造是正则表达式“IF-THEN = ELSE”条件,其具有一般形式(?(COND)THEN|ELSE)。在这里,我使用 COND 测试,这在第一种情况下是先行,但在第二种情况下是先行。两种情况下的 THEN ELSE 子句都是否定的外观,因此它们会考虑字符串的边缘。

我解释了有关在正则表达式here中处理边界和Unicode的更多信息。

Unicode属性支持

current state of affairs in Javascript’s treatment of Unicode 似乎就像Java一样,Javascript对\w的定义仍然因为被困在20世纪60年代而被瘫痪 ASCII世界。我承认,这只是一个悲惨的情况。即使是Python,这些事情都非常保守(例如,它甚至不支持递归正则表达式),确实允许\w\s的定义工作在Unicode上正确。这是最低级别的功能,真的。

Javasscript中既好又坏。这是因为你可以在Javascript(或Java)中使用一些最基本的Unicode属性。看起来您应该能够使用单字符和双字符“常规类别”Unicode属性。这意味着您应该能够使用以下第一列中的短名称版本:

Short Name  Long Name
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
 \pL        \p{Letter}
   \p{Lu}   \p{Uppercase_Letter}
   \p{Ll}   \p{Lowercase_Letter}
   \p{Lt}   \p{Titlecase_Letter}
   \p{Lm}   \p{Modifier_Letter}
   \p{Lo}   \p{Other_Letter}

Short Name  Long Name
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
 \pM       \p{Mark}
   \p{Mn}  \p{Nonspacing_Mark}
   \p{Mc}  \p{Spacing_Mark}
   \p{Me}  \p{Enclosing_Mark}

Short Name  Long Name
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
 \pN       \p{Number}
   \p{Nd}  \p{Decimal_Number},\p{Digit}
   \p{Nl}  \p{Letter_Number}
   \p{No}  \p{Other_Number}

Short Name  Long Name
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
 \pP       \p{Punctuation}, \p{Punct})
   \p{Pc}  \p{Connector_Punctuation}
   \p{Pd}  \p{Dash_Punctuation}
   \p{Ps}  \p{Open_Punctuation}
   \p{Pe}  \p{Close_Punctuation}
   \p{Pi}  \p{Initial_Punctuation}
   \p{Pf}  \p{Final_Punctuation}
   \p{Po}  \p{Other_Punctuation}

Short Name  Long Name
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
 \pS       \p{Symbol}
   \p{Sm}  \p{Math_Symbol}
   \p{Sc}  \p{Currency_Symbol}
   \p{Sk}  \p{Modifier_Symbol}
   \p{So}  \p{Other_Symbol}

Short Name  Long Name
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
 \pZ       \p{Separator}
   \p{Zs}  \p{Space_Separator}
   \p{Zl}  \p{Line_Separator}
   \p{Zp}  \p{Paragraph_Separator}

Short Name  Long Name
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
 \pC       \p{Other}
   \p{Cc}  \p{Control}, \p{Cntrl}
   \p{Cf}  \p{Format}
   \p{Cs}  \p{Surrogate}
   \p{Co}  \p{Private_Use}
   \p{Cn}  \p{Unassigned}

您必须仅在Java和Javascript中使用短名称,但Perl也允许您使用长名称,这有助于提高可读性,因为Perl的5.12版本支持大约3,000个Unicode属性。 Python 仍然没有值得一提的任何Unicode属性支持,Ruby刚刚开始在1.9版本中获得它。 PCRE有一些有限的支持,主要是像Java 1.7那样。

Java6支持Unicode块属性,如\p{InGeneralPunctuation}\p{Block=GeneralPunctuation},Java7支持Unicode脚本属性,如\p{IsHiragana}\p{Script=Hiragana}

但是,它仍然不支持任何接近full set of Unicode properties的内容,包括\p⁠{WhiteSpace}\p{Dash}\p{Quotation_Mark}等近乎重要的内容,更不用说其他两个人,如\p⁠{Line_Break=Alphabetic}\p⁠{East_Asian_Width:Narrow}\p⁠{Numeric_Value=1000}\p⁠⁠{Age:5.2}

前者是非常必不可少的 - 尤其是,因为缺乏对\s工作权的支持 - 而后者的设置有时非常有用。

Java和Javascript尚不支持的其他内容是user-defined character properties。我用了那么多。这样您就可以定义\p⁠{English::Vowel}\p⁠{English::Consonant}等内容,非常方便。

如果您对正则表达式工作的Unicode属性感兴趣,tou可能想要获取 unitrio 程序套件:unipropsunichars和{{3} }。以下是这三者中的每一个的演示:

$ uninames face
 ፦  4966  1366  ETHIOPIC PREFACE COLON
 ⁙  8281  2059  FIVE DOT PUNCTUATION
        = Greek pentonkion
        = quincunx
        x (die face-5 - 2684)
 ∯  8751  222F  SURFACE INTEGRAL
        # 222E 222E
 ☹  9785  2639 WHITE FROWNING FACE
 ☺  9786  263A WHITE SMILING FACE
        = have a nice day!
 ☻  9787  263B BLACK SMILING FACE
 ⚀  9856  2680 DIE FACE-1
 ⚁  9857  2681 DIE FACE-2
 ⚂  9858  2682 DIE FACE-3
 ⚃  9859  2683 DIE FACE-4
 ⚄  9860  2684 DIE FACE-5
 ⚅  9861  2685 DIE FACE-6
 ⾯  12207 2FAF KANGXI RADICAL FACE
        # 9762
 〠  12320 3020 POSTAL MARK FACE
 龜  64206 FACE CJK COMPATIBILITY IDEOGRAPH-FACE
        : 9F9C

关于Unicode属性的FMTEYEWTK:

$ uniprops -va LF 85 Greek:Sigma INFINITY BOM U+3000 U+12345

U+000A ‹U+000A› \N{ LINE FEED (LF) }:
    \s \v \R \pC \p{Cc}
    \p{All} \p{Any} \p{ASCII} \p{Assigned} \p{C} \p{Other} \p{Cc} \p{Cntrl} \p{Common} \p{Zyyy} \p{Control} \p{Pat_WS} \p{Pattern_White_Space} \p{PatWS} \p{PerlSpace} \p{PosixCntrl} \p{PosixSpace} \p{Space} \p{SpacePerl} \p{VertSpace} \p{White_Space} \p{WSpace}
    \p{Age:1.1} \p{Block=Basic_Latin} \p{Bidi_Class:B} \p{Bidi_Class=Paragraph_Separator} \p{Bidi_Class:Paragraph_Separator} \p{Bc=B} \p{Block:ASCII} \p{Block:Basic_Latin} \p{Blk=ASCII} \p{Canonical_Combining_Class:0} \p{Canonical_Combining_Class=Not_Reordered}
       \p{Canonical_Combining_Class:Not_Reordered} \p{Ccc=NR} \p{Canonical_Combining_Class:NR} \p{Script=Common} \p{Decomposition_Type:None} \p{Dt=None} \p{East_Asian_Width=Neutral} \p{East_Asian_Width:Neutral} \p{Grapheme_Cluster_Break:LF} \p{GCB=LF} \p{Hangul_Syllable_Type:NA}
       \p{Hangul_Syllable_Type=Not_Applicable} \p{Hangul_Syllable_Type:Not_Applicable} \p{Hst=NA} \p{Joining_Group:No_Joining_Group} \p{Jg=NoJoiningGroup} \p{Joining_Type:Non_Joining} \p{Jt=U} \p{Joining_Type:U} \p{Joining_Type=Non_Joining} \p{Line_Break:LF} \p{Line_Break=Line_Feed}
       \p{Line_Break:Line_Feed} \p{Lb=LF} \p{Numeric_Type:None} \p{Nt=None} \p{Numeric_Value:NaN} \p{Nv=NaN} \p{Present_In:1.1} \p{Age=1.1} \p{In=1.1} \p{Present_In:2.0} \p{In=2.0} \p{Present_In:2.1} \p{In=2.1} \p{Present_In:3.0} \p{In=3.0} \p{Present_In:3.1} \p{In=3.1}
       \p{Present_In:3.2} \p{In=3.2} \p{Present_In:4.0} \p{In=4.0} \p{Present_In:4.1} \p{In=4.1} \p{Present_In:5.0} \p{In=5.0} \p{Present_In:5.1} \p{In=5.1} \p{Present_In:5.2} \p{In=5.2} \p{Script:Common} \p{Sc=Zyyy} \p{Script:Zyyy} \p{Sentence_Break:LF} \p{SB=LF} \p{Word_Break:LF}
       \p{WB=LF}

U+0085 ‹U+0085› \N{ NEXT LINE (NEL) }:
    \s \v \R \pC \p{Cc}
    \p{All} \p{Any} \p{Assigned} \p{InLatin1} \p{C} \p{Other} \p{Cc} \p{Cntrl} \p{Common} \p{Zyyy} \p{Control} \p{Pat_WS} \p{Pattern_White_Space} \p{PatWS} \p{Space} \p{SpacePerl} \p{VertSpace} \p{White_Space} \p{WSpace}
    \p{Age:1.1} \p{Bidi_Class:B} \p{Bidi_Class=Paragraph_Separator} \p{Bidi_Class:Paragraph_Separator} \p{Bc=B} \p{Block:Latin_1} \p{Block=Latin_1_Supplement} \p{Block:Latin_1_Supplement} \p{Blk=Latin1} \p{Canonical_Combining_Class:0} \p{Canonical_Combining_Class=Not_Reordered}
       \p{Canonical_Combining_Class:Not_Reordered} \p{Ccc=NR} \p{Canonical_Combining_Class:NR} \p{Script=Common} \p{Decomposition_Type:None} \p{Dt=None} \p{East_Asian_Width=Neutral} \p{East_Asian_Width:Neutral} \p{Grapheme_Cluster_Break:CN} \p{Grapheme_Cluster_Break=Control}
       \p{Grapheme_Cluster_Break:Control} \p{GCB=CN} \p{Hangul_Syllable_Type:NA} \p{Hangul_Syllable_Type=Not_Applicable} \p{Hangul_Syllable_Type:Not_Applicable} \p{Hst=NA} \p{Joining_Group:No_Joining_Group} \p{Jg=NoJoiningGroup} \p{Joining_Type:Non_Joining} \p{Jt=U}
       \p{Joining_Type:U} \p{Joining_Type=Non_Joining} \p{Line_Break:Next_Line} \p{Lb=NL} \p{Line_Break:NL} \p{Line_Break=Next_Line} \p{Numeric_Type:None} \p{Nt=None} \p{Numeric_Value:NaN} \p{Nv=NaN} \p{Present_In:1.1} \p{Age=1.1} \p{In=1.1} \p{Present_In:2.0} \p{In=2.0}
       \p{Present_In:2.1} \p{In=2.1} \p{Present_In:3.0} \p{In=3.0} \p{Present_In:3.1} \p{In=3.1} \p{Present_In:3.2} \p{In=3.2} \p{Present_In:4.0} \p{In=4.0} \p{Present_In:4.1} \p{In=4.1} \p{Present_In:5.0} \p{In=5.0} \p{Present_In:5.1} \p{In=5.1} \p{Present_In:5.2} \p{In=5.2}
       \p{Script:Common} \p{Sc=Zyyy} \p{Script:Zyyy} \p{Sentence_Break:SE} \p{Sentence_Break=Sep} \p{Sentence_Break:Sep} \p{SB=SE} \p{Word_Break:Newline} \p{WB=NL} \p{Word_Break:NL} \p{Word_Break=Newline}

U+03A3 ‹Σ› \N{ GREEK CAPITAL LETTER SIGMA }:
    \w \pL} \p{LC} \p{L_} \p{L&} \p{Lu}
    \p{All} \p{Any} \p{Alnum} \p{Alpha} \p{Alphabetic} \p{Assigned} \p{Greek} \p{Is_Greek} \p{InGreek} \p{Cased} \p{Cased_Letter} \p{LC} \p{Changes_When_Casefolded} \p{CWCF} \p{Changes_When_Casemapped} \p{CWCM} \p{Changes_When_Lowercased} \p{CWL} \p{Changes_When_NFKC_Casefolded}
       \p{CWKCF} \p{Lu} \p{L} \p{Gr_Base} \p{Grapheme_Base} \p{Graph} \p{GrBase} \p{Grek} \p{Greek_And_Coptic} \p{ID_Continue} \p{IDC} \p{ID_Start} \p{IDS} \p{Letter} \p{L_} \p{Uppercase_Letter} \p{Print} \p{Upper} \p{Uppercase} \p{Word} \p{XID_Continue} \p{XIDC} \p{XID_Start}
       \p{XIDS}
    \p{Age:1.1} \p{Bidi_Class:L} \p{Bidi_Class=Left_To_Right} \p{Bidi_Class:Left_To_Right} \p{Bc=L} \p{Block:Greek} \p{Block=Greek_And_Coptic} \p{Block:Greek_And_Coptic} \p{Blk=Greek} \p{Canonical_Combining_Class:0} \p{Canonical_Combining_Class=Not_Reordered}
       \p{Canonical_Combining_Class:Not_Reordered} \p{Ccc=NR} \p{Canonical_Combining_Class:NR} \p{Decomposition_Type:None} \p{Dt=None} \p{East_Asian_Width:A} \p{East_Asian_Width=Ambiguous} \p{East_Asian_Width:Ambiguous} \p{Ea=A} \p{Grapheme_Cluster_Break:Other} \p{GCB=XX}
       \p{Grapheme_Cluster_Break:XX} \p{Grapheme_Cluster_Break=Other} \p{Script=Greek} \p{Hangul_Syllable_Type:NA} \p{Hangul_Syllable_Type=Not_Applicable} \p{Hangul_Syllable_Type:Not_Applicable} \p{Hst=NA} \p{Joining_Group:No_Joining_Group} \p{Jg=NoJoiningGroup}
       \p{Joining_Type:Non_Joining} \p{Jt=U} \p{Joining_Type:U} \p{Joining_Type=Non_Joining} \p{Line_Break:AL} \p{Line_Break=Alphabetic} \p{Line_Break:Alphabetic} \p{Lb=AL} \p{Numeric_Type:None} \p{Nt=None} \p{Numeric_Value:NaN} \p{Nv=NaN} \p{Present_In:1.1} \p{Age=1.1} \p{In=1.1}
       \p{Present_In:2.0} \p{In=2.0} \p{Present_In:2.1} \p{In=2.1} \p{Present_In:3.0} \p{In=3.0} \p{Present_In:3.1} \p{In=3.1} \p{Present_In:3.2} \p{In=3.2} \p{Present_In:4.0} \p{In=4.0} \p{Present_In:4.1} \p{In=4.1} \p{Present_In:5.0} \p{In=5.0} \p{Present_In:5.1} \p{In=5.1}
       \p{Present_In:5.2} \p{In=5.2} \p{Script:Greek} \p{Sc=Grek} \p{Script:Grek} \p{Sentence_Break:UP} \p{Sentence_Break=Upper} \p{Sentence_Break:Upper} \p{SB=UP} \p{Word_Break:ALetter} \p{WB=LE} \p{Word_Break:LE} \p{Word_Break=ALetter}

U+221E ‹∞› \N{ INFINITY }:
    \pS \p{Sm}
    \p{All} \p{Any} \p{Assigned} \p{InMathematicalOperators} \p{Common} \p{Zyyy} \p{Sm} \p{S} \p{Gr_Base} \p{Grapheme_Base} \p{Graph} \p{GrBase} \p{Math} \p{Math_Symbol} \p{Pat_Syn} \p{Pattern_Syntax} \p{PatSyn} \p{Print} \p{Symbol}
    \p{Age:1.1} \p{Bidi_Class:ON} \p{Bidi_Class=Other_Neutral} \p{Bidi_Class:Other_Neutral} \p{Bc=ON} \p{Block:Mathematical_Operators} \p{Canonical_Combining_Class:0} \p{Canonical_Combining_Class=Not_Reordered} \p{Canonical_Combining_Class:Not_Reordered} \p{Ccc=NR}
       \p{Canonical_Combining_Class:NR} \p{Script=Common} \p{Decomposition_Type:None} \p{Dt=None} \p{East_Asian_Width:A} \p{East_Asian_Width=Ambiguous} \p{East_Asian_Width:Ambiguous} \p{Ea=A} \p{Grapheme_Cluster_Break:Other} \p{GCB=XX} \p{Grapheme_Cluster_Break:XX}
       \p{Grapheme_Cluster_Break=Other} \p{Hangul_Syllable_Type:NA} \p{Hangul_Syllable_Type=Not_Applicable} \p{Hangul_Syllable_Type:Not_Applicable} \p{Hst=NA} \p{Joining_Group:No_Joining_Group} \p{Jg=NoJoiningGroup} \p{Joining_Type:Non_Joining} \p{Jt=U} \p{Joining_Type:U}
       \p{Joining_Type=Non_Joining} \p{Line_Break:AI} \p{Line_Break=Ambiguous} \p{Line_Break:Ambiguous} \p{Lb=AI} \p{Numeric_Type:None} \p{Nt=None} \p{Numeric_Value:NaN} \p{Nv=NaN} \p{Present_In:1.1} \p{Age=1.1} \p{In=1.1} \p{Present_In:2.0} \p{In=2.0} \p{Present_In:2.1} \p{In=2.1}
       \p{Present_In:3.0} \p{In=3.0} \p{Present_In:3.1} \p{In=3.1} \p{Present_In:3.2} \p{In=3.2} \p{Present_In:4.0} \p{In=4.0} \p{Present_In:4.1} \p{In=4.1} \p{Present_In:5.0} \p{In=5.0} \p{Present_In:5.1} \p{In=5.1} \p{Present_In:5.2} \p{In=5.2} \p{Script:Common} \p{Sc=Zyyy}
       \p{Script:Zyyy} \p{Sentence_Break:Other} \p{SB=XX} \p{Sentence_Break:XX} \p{Sentence_Break=Other} \p{Word_Break:Other} \p{WB=XX} \p{Word_Break:XX} \p{Word_Break=Other}

U+FEFF ‹U+FEFF› \N{ ZERO WIDTH NO-BREAK SPACE }:
    \pC \p{Cf}
    \p{All} \p{Any} \p{Assigned} \p{InArabicPresentationFormsB} \p{C} \p{Other} \p{Case_Ignorable} \p{CI} \p{Cf} \p{Format} \p{Changes_When_NFKC_Casefolded} \p{CWKCF} \p{Common} \p{Zyyy} \p{Default_Ignorable_Code_Point} \p{DI} \p{Graph} \p{Print}
    \p{Age:1.1} \p{Bidi_Class:BN} \p{Bidi_Class=Boundary_Neutral} \p{Bidi_Class:Boundary_Neutral} \p{Bc=BN} \p{Block:Arabic_Presentation_Forms_B} \p{Canonical_Combining_Class:0} \p{Canonical_Combining_Class=Not_Reordered} \p{Canonical_Combining_Class:Not_Reordered} \p{Ccc=NR}
       \p{Canonical_Combining_Class:NR} \p{Script=Common} \p{Decomposition_Type:None} \p{Dt=None} \p{East_Asian_Width=Neutral} \p{East_Asian_Width:Neutral} \p{Grapheme_Cluster_Break:CN} \p{Grapheme_Cluster_Break=Control} \p{Grapheme_Cluster_Break:Control} \p{GCB=CN}
       \p{Hangul_Syllable_Type:NA} \p{Hangul_Syllable_Type=Not_Applicable} \p{Hangul_Syllable_Type:Not_Applicable} \p{Hst=NA} \p{Joining_Group:No_Joining_Group} \p{Jg=NoJoiningGroup} \p{Joining_Type:T} \p{Joining_Type=Transparent} \p{Joining_Type:Transparent} \p{Jt=T}
       \p{Line_Break:WJ} \p{Line_Break=Word_Joiner} \p{Line_Break:Word_Joiner} \p{Lb=WJ} \p{Numeric_Type:None} \p{Nt=None} \p{Numeric_Value:NaN} \p{Nv=NaN} \p{Present_In:1.1} \p{Age=1.1} \p{In=1.1} \p{Present_In:2.0} \p{In=2.0} \p{Present_In:2.1} \p{In=2.1} \p{Present_In:3.0}
       \p{In=3.0} \p{Present_In:3.1} \p{In=3.1} \p{Present_In:3.2} \p{In=3.2} \p{Present_In:4.0} \p{In=4.0} \p{Present_In:4.1} \p{In=4.1} \p{Present_In:5.0} \p{In=5.0} \p{Present_In:5.1} \p{In=5.1} \p{Present_In:5.2} \p{In=5.2} \p{Script:Common} \p{Sc=Zyyy} \p{Script:Zyyy}
       \p{Sentence_Break:FO} \p{Sentence_Break=Format} \p{Sentence_Break:Format} \p{SB=FO} \p{Word_Break:FO} \p{Word_Break=Format} \p{Word_Break:Format} \p{WB=FO}

U+3000 ‹U+3000› \N{ IDEOGRAPHIC SPACE }:
    \s \h \pZ \p{Zs}
    \p{All} \p{Any} \p{Assigned} \p{Blank} \p{InCJKSymbolsAndPunctuation} \p{Changes_When_NFKC_Casefolded} \p{CWKCF} \p{Common} \p{Zyyy} \p{Z} \p{Zs} \p{Gr_Base} \p{Grapheme_Base} \p{GrBase} \p{HorizSpace} \p{Print} \p{Separator} \p{Space} \p{Space_Separator} \p{SpacePerl}
       \p{White_Space} \p{WSpace}
    \p{Age:1.1} \p{Bidi_Class:White_Space} \p{Bc=WS} \p{Bidi_Class:WS} \p{Bidi_Class=White_Space} \p{Block:CJK_Symbols_And_Punctuation} \p{Canonical_Combining_Class:0} \p{Canonical_Combining_Class=Not_Reordered} \p{Canonical_Combining_Class:Not_Reordered} \p{Ccc=NR}
       \p{Canonical_Combining_Class:NR} \p{Script=Common} \p{Decomposition_Type:Non_Canon} \p{Decomposition_Type=Non_Canonical} \p{Decomposition_Type:Non_Canonical} \p{Dt=NonCanon} \p{Decomposition_Type:Wide} \p{Dt=Wide} \p{East_Asian_Width:F} \p{East_Asian_Width=Fullwidth}
       \p{East_Asian_Width:Fullwidth} \p{Ea=F} \p{Grapheme_Cluster_Break:Other} \p{GCB=XX} \p{Grapheme_Cluster_Break:XX} \p{Grapheme_Cluster_Break=Other} \p{Hangul_Syllable_Type:NA} \p{Hangul_Syllable_Type=Not_Applicable} \p{Hangul_Syllable_Type:Not_Applicable} \p{Hst=NA}
       \p{Joining_Group:No_Joining_Group} \p{Jg=NoJoiningGroup} \p{Joining_Type:Non_Joining} \p{Jt=U} \p{Joining_Type:U} \p{Joining_Type=Non_Joining} \p{Line_Break:ID} \p{Line_Break=Ideographic} \p{Line_Break:Ideographic} \p{Lb=ID} \p{Numeric_Type:None} \p{Nt=None}
       \p{Numeric_Value:NaN} \p{Nv=NaN} \p{Present_In:1.1} \p{Age=1.1} \p{In=1.1} \p{Present_In:2.0} \p{In=2.0} \p{Present_In:2.1} \p{In=2.1} \p{Present_In:3.0} \p{In=3.0} \p{Present_In:3.1} \p{In=3.1} \p{Present_In:3.2} \p{In=3.2} \p{Present_In:4.0} \p{In=4.0} \p{Present_In:4.1}
       \p{In=4.1} \p{Present_In:5.0} \p{In=5.0} \p{Present_In:5.1} \p{In=5.1} \p{Present_In:5.2} \p{In=5.2} \p{Script:Common} \p{Sc=Zyyy} \p{Script:Zyyy} \p{Sentence_Break:Sp} \p{SB=Sp} \p{Word_Break:Other} \p{WB=XX} \p{Word_Break:XX} \p{Word_Break=Other}

U+12345 ‹› \N{ CUNEIFORM SIGN URU TIMES KI }:
    \w} \p{\pL} \p{L_} \p{Lo}
    \p{All} \p{Any} \p{Alnum} \p{Alpha} \p{Alphabetic} \p{Assigned} \p{InCuneiform} \p{Cuneiform} \p{Is_Cuneiform} \p{Xsux} \p{L} \p{Lo} \p{Gr_Base} \p{Grapheme_Base} \p{Graph} \p{GrBase} \p{ID_Continue} \p{IDC} \p{ID_Start} \p{IDS} \p{Letter} \p{L_} \p{Other_Letter} \p{Print}
       \p{Word} \p{XID_Continue} \p{XIDC} \p{XID_Start} \p{XIDS}
    \p{Age:5.0} \p{Bidi_Class:L} \p{Bidi_Class=Left_To_Right} \p{Bidi_Class:Left_To_Right} \p{Bc=L} \p{Block:Cuneiform} \p{Canonical_Combining_Class:0} \p{Canonical_Combining_Class=Not_Reordered} \p{Canonical_Combining_Class:Not_Reordered} \p{Ccc=NR}
       \p{Canonical_Combining_Class:NR} \p{Script=Cuneiform} \p{Block=Cuneiform} \p{Decomposition_Type:None} \p{Dt=None} \p{East_Asian_Width=Neutral} \p{East_Asian_Width:Neutral} \p{Grapheme_Cluster_Break:Other} \p{GCB=XX} \p{Grapheme_Cluster_Break:XX}
       \p{Grapheme_Cluster_Break=Other} \p{Hangul_Syllable_Type:NA} \p{Hangul_Syllable_Type=Not_Applicable} \p{Hangul_Syllable_Type:Not_Applicable} \p{Hst=NA} \p{Joining_Group:No_Joining_Group} \p{Jg=NoJoiningGroup} \p{Joining_Type:Non_Joining} \p{Jt=U} \p{Joining_Type:U}
       \p{Joining_Type=Non_Joining} \p{Line_Break:AL} \p{Line_Break=Alphabetic} \p{Line_Break:Alphabetic} \p{Lb=AL} \p{Numeric_Type:None} \p{Nt=None} \p{Numeric_Value:NaN} \p{Nv=NaN} \p{Present_In:5.0} \p{In=5.0} \p{Present_In:5.1} \p{In=5.1} \p{Present_In:5.2} \p{In=5.2}
       \p{Script:Cuneiform} \p{Sc=Xsux} \p{Script:Xsux} \p{Sentence_Break:LE} \p{Sentence_Break=OLetter} \p{Sentence_Break:OLetter} \p{SB=LE} \p{Word_Break:ALetter} \p{WB=LE} \p{Word_Break:LE} \p{Word_Break=ALetter}

或者,走另一条路:

$ unichars '\pN' '\D' '\p{Latin}'
 Ⅰ      8544  02160  ROMAN NUMERAL ONE
 Ⅱ      8545  02161  ROMAN NUMERAL TWO
 Ⅲ      8546  02162  ROMAN NUMERAL THREE
 Ⅳ      8547  02163  ROMAN NUMERAL FOUR
 Ⅴ      8548  02164  ROMAN NUMERAL FIVE
 Ⅵ      8549  02165  ROMAN NUMERAL SIX
 Ⅶ      8550  02166  ROMAN NUMERAL SEVEN
 Ⅷ      8551  02167  ROMAN NUMERAL EIGHT
 (etc)

$ unichars -a '\pL' '\p{Greek}' 'NFD ne NFKD' 'NAME =~ /SYMBOL/'
 ϐ       976  3D0  GREEK BETA SYMBOL
 ϑ       977  3D1  GREEK THETA SYMBOL
 ϒ       978  3D2  GREEK UPSILON WITH HOOK SYMBOL
 ϓ       979  3D3  GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
 ϔ       980  3D4  GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
 ϕ       981  3D5  GREEK PHI SYMBOL
 ϖ       982  3D6  GREEK PI SYMBOL
 ϰ      1008  3F0  GREEK KAPPA SYMBOL
 ϱ      1009  3F1  GREEK RHO SYMBOL
 ϲ      1010  3F2  GREEK LUNATE SIGMA SYMBOL
 ϴ      1012  3F4  GREEK CAPITAL THETA SYMBOL
 ϵ      1013  3F5  GREEK LUNATE EPSILON SYMBOL
 Ϲ      1017  3F9  GREEK CAPITAL LUNATE SIGMA SYMBOL

哦, BNM 意味着“勇敢的新千年”,指的是我们现代的后ASCII世界,其中字符不仅仅是七个微小的宽度。 ☺

答案 3 :(得分:1)

String.replace()接受回调函数作为其第二个参数。 (不知道为什么这么多JS教程省略了这个有用的功能。)因此,我们可以编写自己的单词边界测试。

其他地方提出的解决方案,使用正则表达式/(\W|^)(fancy namé|namé)(\W|$)/ig,在“naméé”等文本的情况下会出现误报。

String.prototype.isWordCharAt = function(i) {
    // should work for European languages and Unicode
    return (this.charAt(i) >= 'A' && this.charAt(i) <= 'Z')
        || (this.charAt(i) >= 'a' && this.charAt(i) <= 'z')
        || (this.charCodeAt(i) >= 0xC0 && this.charCodeAt(i) < 0x2000)
    ;
};

"Namé. Goal: Some Fancy Namé. Namé. Nénamé. Namée. Nénamée. Namé"
.replace(/(Namé|Fancy Namé)/ig, function(
match, part1, /* part2, part3, ... */ offset, fullText) {
  // Keep in mind that the number of arguments changes
  // if the number of capturing parenthesis in regexp changes.
  // We could use 'arguments' pseudo-array instead.
  var len1 = part1.length;
  var leftWordBoundary;
  var rightWordBoundary;

  if (offset === 0) {
    leftWordBoundary = fullText.isWordCharAt(offset);
  }
  else {
    leftWordBoundary = (fullText.isWordCharAt(offset - 1)
      != fullText.isWordCharAt(offset));
  }

  if (offset + len1 == fullText.length) {
    rightWordBoundary = fullText.isWordCharAt(offset + len1 - 1);
  }
  else {
    rightWordBoundary = (fullText.isWordCharAt(offset + len1 - 1)
      != fullText.isWordCharAt(offset + len1));
  }

  if (leftWordBoundary && rightWordBoundary) {
    return '<a href="#">' + part1 + '</a>';
  }
  else {
    return part1;
  }
});

答案 4 :(得分:0)

如果要匹配“ my_word” 您可以在?<!后面使用消极外观,在?!前面使用消极外观

将检查单词的前面没有非单词字符,也没有后面有非单词字符 new RegExp(`(?<![A-Za-zÀ-ÖØ-öø-ÿ])my_word(?![A-Za-zÀ-ÖØ-öø-ÿ])`, "gi");

-用于ASCII表中的间隔。 在这里检查它的Ascii表正好是您需要的 http://seamons.com/projects/js/ascii_table.html

答案 5 :(得分:0)

正如其他回答者已经指出的那样,JS 正则表达式引擎不认为“é”是一个单词字符。既然是这样,并且您想匹配该字母后跟另一个非单词字符,则可以在此处使用 \B assertion

> "Goal: Some Fancy Namé. Awesome.".replace(/\b(Fancy Namé|Namé)\B/i, '<a href="#">$1</a>');
'Goal: Some <a href="#">Fancy Namé</a>. Awesome.'

如果你想让它的意图显而易见的话,这可能不是最好的代码,但它在这种情况下有效,呵呵。

答案 6 :(得分:-1)

使用正则表达式时,也许可以尝试使用\o\x标记。

this reference for Javascript regular expressions的结尾可能会帮助你。

关于与é相关的实际八进制/十六进制值,我不确定。