Question

声明为编码的UTF-8的给定XML文件未通过xmllint。假设非UTF-8字符导致错误，则对文件运行以下sed命令。 sed 's/[^\x00-\x7F]//g' file.xml。命令错误或非UTF-8字符不是问题，因为xmllint在运行sed后仍然失败。第一个问题是：sed正则表达式是否正确？

= = = = =

以下是xmllint的输出： $ xmllint file.xml file.xml:35533: parser error : CData section not finished <img alt="Diets of 2013" src="h What You Eat: Foods low in sugar and carbs and high in fat—80% of cal ^ file.xml:35533: parser error : PCDATA invalid Char value 31 What You Eat: Foods low in sugar and carbs and high in fat—80% of cal ^ file.xml:35588: parser error : Sequence ']]>' not allowed in content as.people.com/2013/11/07/kerry-washington-pregnant-diet-green-smoothie-recipe/"] ^

= = = = =

更新：在TextMate中，在查看文件时，有一个字符显示为<US>。如果从文件中手动删除该字符，则该文件将通过xmllint。

Answer 1

使用sed从Unicode表中删除特定代码点有些困难。

如果您需要定位特定的Unicode字符类别，则使用Perl更有意义。

perl -pe -i 's/(?![\t\n\r])\p{Cc}//g' file

将删除除TAB，CR和LF之外的所有控制字符。

使用sed从XML文件中删除非UTF-8字符

1 个答案: