Here is test sample file--rime.txt.
file rime.txt
rime.txt: UTF-8 Unicode text
wc -c rime.txt
25483 rime.txt
awk '{num=num+length($0)}END{print num}' rime.txt
24648
length($0) contain whitespaces ,do not contain newline(0a).
awk 'END{print NR}' rime.txt
833
There are 833 0a--newline in rime.txt.
echo "25483-24648-833" |bc
2
Where is the two characters which awk can't count?
wc -m rime.txt
25481 rime.txt
There are two bytes which don't map any characters ,what are they?
How to find it out?
答案 0 :(得分:0)
wc -c
计算字节数,而不是字符数。如果你有多字节编码(例如任何UTF变体),这将无法获得正确的字符数。
无论编码如何,您都需要使用wc -m
来获取字符数:
wc -m rime.txt
同样可以想象,对于ASCII字符集(确切地说,对于所有单字节编码字符),wc -c
和wc -m
都会得到相同的计数。
答案 1 :(得分:0)
wc -c rime.txt
25482 rimie.txt
wc -m rime.txt
25480 rime.txt
grep -P "[^\x00-\x7F]" rime.txt
That come from a far Contrée.
And now all in mine own Countrée
awk '/[^\x00-\x7F]/{print}' rime.txt
That come from a far Contrée.
And now all in mine own Countrée
字符é是asyii十六进制值,即 e9 。
ec 可以通过 wc -c (字节模式)知道; wc -m 无法识别 e9 (字符模式); rime.txt中有两个é或说 e9 。
所以 wc -c rime.txt 比 wc -m rime.txt 多两个。