Where are the two characters when to count with awk?

时间:2017-04-10 03:30:40

标签: bash awk

Here is test sample file--rime.txt.

file rime.txt
rime.txt: UTF-8 Unicode text

rime.txt

wc  -c  rime.txt
25483 rime.txt
awk '{num=num+length($0)}END{print num}' rime.txt
24648

length($0) contain whitespaces ,do not contain newline(0a).

awk 'END{print NR}' rime.txt
833

There are 833 0a--newline in rime.txt.

echo "25483-24648-833" |bc
2

Where is the two characters which awk can't count?

wc -m rime.txt
25481 rime.txt

There are two bytes which don't map any characters ,what are they?
How to find it out?

2 个答案:

答案 0 :(得分:0)

wc -c计算字节数,而不是字符数。如果你有多字节编码(例如任何UTF变体),这将无法获得正确的字符数。

无论编码如何,您都需要使用wc -m来获取字符数:

wc -m rime.txt

同样可以想象,对于ASCII字符集(确切地说,对于所有单字节编码字符),wc -cwc -m都会得到相同的计数。

答案 1 :(得分:0)

wc -c rime.txt
25482 rimie.txt
wc -m  rime.txt
25480 rime.txt
grep -P "[^\x00-\x7F]" rime.txt
       That come from a far Contrée.
     And now all in mine own Countrée
awk '/[^\x00-\x7F]/{print}' rime.txt
       That come from a far Contrée.
     And now all in mine own Countrée

字符é是asyii十六进制值,即 e9
ec 可以通过 wc -c (字节模式)知道; wc -m 无法识别 e9 (字符模式); rime.txt中有两个é或说 e9
所以 wc -c rime.txt wc -m rime.txt 多两个。