Question

Here is test sample file--rime.txt.

file rime.txt
rime.txt: UTF-8 Unicode text

rime.txt

wc  -c  rime.txt
25483 rime.txt
awk '{num=num+length($0)}END{print num}' rime.txt
24648

length($0) contain whitespaces ,do not contain newline(0a).

awk 'END{print NR}' rime.txt
833

There are 833 0a--newline in rime.txt.

echo "25483-24648-833" |bc
2

Where is the two characters which awk can't count?

wc -m rime.txt
25481 rime.txt

There are two bytes which don't map any characters ,what are they?
How to find it out?

Answer 1

wc -c计算字节数，而不是字符数。如果你有多字节编码（例如任何UTF变体），这将无法获得正确的字符数。

无论编码如何，您都需要使用wc -m来获取字符数：

wc -m rime.txt

同样可以想象，对于ASCII字符集（确切地说，对于所有单字节编码字符），wc -c和wc -m都会得到相同的计数。

Answer 2

wc -c rime.txt
25482 rimie.txt
wc -m  rime.txt
25480 rime.txt
grep -P "[^\x00-\x7F]" rime.txt
       That come from a far Contrée.
     And now all in mine own Countrée
awk '/[^\x00-\x7F]/{print}' rime.txt
       That come from a far Contrée.
     And now all in mine own Countrée

字符é是asyii十六进制值，即 e9 。
ec 可以通过 wc -c （字节模式）知道; wc -m 无法识别 e9 （字符模式）; rime.txt中有两个é或说 e9 。
所以 wc -c rime.txt 比 wc -m rime.txt 多两个。

Where are the two characters when to count with awk?

2 个答案: