Question

我是unix命令和正则表达式的新手。我将以下命令应用于this English corpus，我不确定。

一个。计算单词总数（标记）：我得到2685545

wc -w  testFile.txt

湾计算唯一单词（类型）的总数。我写了两个不同的命令，并不确定哪一个是正确的。类型数量：657286或74066

cat  testFile.txt |perl -pe 's/\s/\n/g;' |sort  |uniq -c   

or 

cat  testFile.txt |perl -pe 's/\s/\n/g;' |sort |uniq -c |wc -w

℃。计算忽略大小写的唯一单词总数。我得到了1910951

cat testFile.txt |perl -pe 's/[a-z]\w+/\n/g;' |sort |uniq -c

d。计算纯数字标记的总数。

cat  testFile.txt |perl -pe 's/\s/\n/g;' |grep '[0-9]{1,}' |sort |uniq -c |wc -w

即用它们计算带有非单词字符的总位数（例如8,000.00）我得到18666230

wc -c  testFile.txt |perl -pe ’s/[0-9]{1,}\W+[0-9]{1,}\W+[0-9]
{1,}/\n/g;’

F。计算以大写字母开头的单词总数。我得到1048

cat  testFile.txt |perl -pe 's/[A-Z]\w+/\n/g;' |egrep '[A-Z]\w+' |wc -w

克。什么是最常见的15个最常见的句子

cat testFile.txt |perl -pe 's/\s/\n/g;' |sort |uniq -c |sort -nr 
|head -15

小时。什么是最常见的大写单词（不是句子首字母）。

perl -nE 'say $1 while /(\w*[A-Z]+\w*)/g' testFile.txt

我得到了这个list（截图是输出的一部分）：

我。计算所有出现的罗马数字2684068

cat  testFile.txt |egrep -i '[IX|IV|V?I{1,3}]' |wc -w

非常感谢您的帮助！

Answer 1

让我们看一下您发布的文字的前几行：

$ cat file.txt
RESOLUTION 55/100 
Adopted at the 81st plenary meeting, on 4 December 2000, on the recommendation of the Committee (A/55/602/Add.2 and Corr.1, para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as follows: 
55/100. Respect for the right to universal freedom of travel and the vital importance of family reunification 
The General Assembly, 
Reaffirming that all human rights and fundamental freedoms are universal, indivisible, interdependent and interrelated, 
Recalling the provisions of the Universal Declaration of Human Rights, as well as article 12 of the International Covenant on Civil and Political Rights, 
Stressing that, as stated in the Programme of Action of the International Conference on Population and Development, family reunification of documented migrants is an important factor in international migration and that remittances by documented migrants to their countries of origin often constitute a very important source of foreign exchange and are instrumental in improving the well-being of relatives left behind, 
Recalling its resolution 54/169 of 17 December 1999, 
1. Once again calls upon all States to guarantee the universally recognized freedom of travel to all foreign nationals legally residing in their territory; 
2. Reaffirms that all Governments, in particular those of receiving countries, must recognize the vital importance of family reunification and promote its incorporation into national legislation in order to ensure protection of the unity of families of documented migrants; 
3. Calls upon all States to allow, in conformity with international legislation, the free flow of financial remittances by foreign nationals residing in their territory to their relatives in the country of origin; 
4. Also calls upon all States to refrain from enacting, and to repeal if it already exists, legislation intended as a coercive measure that discriminates against individuals or groups of legal migrants by adversely affecting family reunification and the right to send financial remittance to relatives in the country of origin; 
5. Decides to continue its consideration of this question at its fifty-seventh session under the item entitled "Human rights questions".

首先要做的是定义＆＃39;什么是单词＆＃39;？

RESOLUTION显然是;怎么样55/100？
引号或括号中的内容如何？
＆＃39;大＆＃39;怎么样？＆＃39;大＆＃39; ＆＃39;大＆＃39！; ＆＃39;大＆＃39？; ＆＃39;大＆＃39 ;?那些是同一个词还是四个不同的词？ wc将这些视为四个不同的词。

假设你的意思是“＆＃39;是剥离了所有非单词字符的小写版本，您可以使用正则表达式查找所有单词然后小写，以便它们进行比较。

在Perl中，您可以使用regex /\b(\p{L}+)/来查找字词。

$ perl -lne 'while (/\b(\p{L}+)/g) {$h{lc($1)}++;} END{foreach (sort { $h{$b} <=> $h{$a} } keys(%h)) {print "$_: $h{$_}"}}' file.txt
of: 25
the: 20
to: 13
and: 11
in: 10
all: 6
that: 5
as: 5
migrants: 4
reunification: 4
family: 4
their: 4
its: 4
by: 4
rights: 4
on: 4
international: 4
...

为您的文件添加唯一字数和总字数，我得到：

$ perl -lne 'while (/\b(\p{L}+)/g) {$h{lc($1)}++;} 
                  END{print "unique words: ".scalar keys %h; 
                  foreach (values %h) { $s+=$_ }
                  print "total words: $s";
                  foreach (sort { $h{$b} <=> $h{$a} } keys(%h)) {print "$_: $h{$_}"}}' testFile.txt | head
unique words: 11263
total words: 2616047
the: 272618
of: 176015
and: 138295
to: 101670
in: 67440
on: 36025
for: 32558
a: 24742

Unix命令和正则表达式查找不同条件的总计数

1 个答案: