Question

Donald Knuth曾经有一个任务来编写一个计算文件字频的识字程序。

读取文本文件，确定n个最常用的单词，并打印出这些单词及其频率的排序列表。

道格·麦克罗伊（Doug McIlroy）以几行sh重写了Pascal的10页：

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

通过一些练习，我将其转换为Powershell：

(-split ((Get-Content -Raw test.txt).ToLower() -replace '[^a-zA-Z]',' ')) |
  Group-Object |
  Sort-Object -Property count -Descending |
  Select-Object -First $Args[0] |
  Format-Table count, name

我喜欢Powershell将sort | uniq -c合并为一个Group-Object。

第一行看起来很难看，所以我想知道它是否可以更优雅地编写？也许可以用某种方式用正则表达式定界符加载文件？

缩短代码的一种明显方法是使用别名，但这不利于可读性。

Answer 1

我会这样。

PS C:\users\me> Get-Content words.txt
One one
two
two
three,three.
two;two


PS C:\users\me> (Get-Content words.txt) -Split '\W' | Group-Object

Count Name                      Group
----- ----                      -----
    2 One                       {One, one}
    4 two                       {two, two, two, two}
    2 three                     {three, three}
    1                           {}

Answer 2

感谢js2010和LotPings的重要提示。要记录可能是最好的解决方案：

$Input -split '\W+' |
  Group-Object -NoElement |
  Sort-Object count -Descending |
  Select-Object -First $Args[0]

我学到的东西

$Input包含标准输入。这比获取内容文件更接近McIlroys代码。
split实际上可以使用正则表达式定界符
-NoElement参数让我摆脱了Format-Table行。

Answer 3

Windows 10 64位。 PowerShell 5

如何查找整个单词（the而不是-the-或wea the r），无论大小写如何在文本文件中使用最多，以及使用Powershell使用多少次：

用文件替换1.txt。

$z = gc 1.txt -raw
-split $z | group -n | sort c* | select -l 1

结果：

Count Name
----- ----
30    THE

在Powershell中优雅地显示词频

3 个答案: