测试从字符串中过滤非法字符

时间:2015-06-10 12:54:42

标签: regex unicode amazon-cloudsearch

我需要根据准备亚马逊云搜索数据的指南中的字符串过滤掉非法的unicode字符。

Both JSON and XML batches can only contain UTF-8 characters that are valid in 
XML. Valid characters are the control characters tab (0009), carriage return 
(000D), and line feed (000A), and the legal characters of Unicode and ISO/IEC 
10646. FFFE, FFFF, and the surrogate blocks D800–DBFF and DC00–DFFF are 
invalid and will cause errors. (For more information, see Extensible Markup 
Language (XML) 1.0 (Fifth Edition).) 

You can use the following regular expression to match invalid characters 
so you can remove them: /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/ .

我正在尝试为成功和失败案例编写测试,我在编写禁止范围内的unicode字符时遇到问题。

Edit2:Javascript是我试图在

中编写测试的语言

Edit1:Amazon Cloudsearch文档的链接:http://docs.aws.amazon.com/cloudsearch/latest/developerguide/preparing-data.html

1 个答案:

答案 0 :(得分:2)

在JavaScript中,您可以使用Unicode转义序列将这些无效字符生成为字符串,如:"\uFFFE""\uFFFF""\uD800"等。但请注意:"\uD83C\uDF4C"是一个JavaScript字符串,表示"",香蕉字符,Unicode代码点1F34C。亚马逊API禁止使用UTF-8直接编码的单独代理。编码为UTF-8的香蕉字符(1F34C)有效(如字节F0 9F 8D 8C),因此该代理对有效。无效的是D83C本身的UTF-8编码,即字节ED A0 BC。