Question

I am trying to grep for the hexadecimal value of a range of UTF-8 encoded characters and I only want just that specific range of characters to be returned. I currently have this:

grep -P -n "[\xB9-\xBF]" $str_st_location >> output_st.txt

But this returns every character that has any of those hex values in it hex representation i.e it returns 00B9 - FFB9 as long as the B9 is present.

Is there a way I can specify using grep that I only want the exact/specific hex value range I search for?

Sample Input:

STRING_OPEN
Open
æ–å¼€
Ouvert
Abierto
ÐžÑ‚ÐºÑ€Ñ‹Ñ‚Ð¾
Abrir

Now using my grep statement, it should return the 3rd line and 6th line, but it also includes some text in my file that are Russian and Chinese because the range for languages include the hex values I'm searching for like these:

断开
Открыто

I can't give out more sample input unfortunately as it's work related.

EDIT: Actually the below code snippet worked!

grep -P  -n "[\x{00B9}-\x{00BF}]" $str_st_location > output_st.txt

It found all the corrupted characters and there were no false positives. The only issue now is that the lines with the corrupted characters automatically gets "uncorrupted" i.e when I open the file, grep's output is the corrected version of the corrupted characters. For example, it finds æ–å¼€ and in the text file, it's show as 断开.

Answer 1

由于您正在使用-P，因此您可能正在使用GNU grep，因为它是GNU grep扩展。您的命令使用GNU grep 2.21和pcre 8.37以及UTF-8语言环境，但是过去有多字节字符和字符范围的错误。您可能正在使用旧版本，或者您的语言环境可能设置为使用单字节字符的语言环境。

如果您不想升级，可以通过匹配单个字节来匹配此字符范围，这应该适用于旧版本。您需要将字符转换为字节并搜索字节值。假设UTF-8，U + 00B9是C2 B9，U + 00BF是C2 BF。将LC_CTYPE设置为使用单字节字符（如C）的内容将确保即使在正确支持多字节字符的版本中它也会匹配单个字节。

LC_CTYPE=C grep -P -n "\xC2[\xB9-\xBF]" $str_st_location >> output_st.txt

How to grep for exact hexadecimal value of characters

1 个答案: