Question

假设我在C盘中有一个文本文件test.txt。

On the face of things, we seem to be merely talking about text-based files, containing only 
the letters of the English Alphabet (and the occasional punctuation mark).
On deeper inspection, of course, this isn't quite the case. What this site
offers is a glimpse into the history of writers and artists bound by the 128 
characters that the American Standard Code for Information Interchange (ASCII) 
 allowed them. The focus is on mid-1980's textfiles and the world as it was then, 
but even these files are sometime retooled 1960s and 1970s works, and offshoots 
 of this culture exist to this day.

我想将所有行拆分为单词，然后将其另存为新文件。在新文件中，每行只包含一个单词。

因此新文件将是：

       On
       the
       face
       of
       things
       we
       seem
       to
       ....

分隔符是一个空格，请跳过所有标点符号。

Answer 1

你还没试过。下次我投票支持封闭式问题。 Powershell使用了99％的c＃语法和“all”.Net类，所以如果你知道c＃，那么在谷歌上使用5分钟并尝试一些命令，你将在PowerShell中走得更远。

#create array
$words = @()

#read file
$lines = [System.IO.File]::ReadAllLines("C:\Users\Frode\Desktop\in.txt")

#split words
foreach ($line in $lines) {
    $words += $line.Split(" ,.", [System.StringSplitOptions]::RemoveEmptyEntries)
}

#save words
[System.IO.File]::WriteAllLines("C:\Users\Frode\Desktop\out.txt", $words)

在PowerShell中你也可以这样做：

Get-Content .\in.txt | ForEach-Object { 
    $_.Split(" ,.", [System.StringSplitOptions]::RemoveEmptyEntries) 
} | Set-Content out.txt

Answer 2

$Text = @'
On the face of things, we seem to be merely talking about
text-based files, containing only  the letters of the English Alphabet
(and the occasional punctuation mark). On deeper inspection, of
course, this isn't quite the case. What this site offers is a glimpse
into the history of writers and artists bound by the 128  characters
that the American Standard Code for Information Interchange (ASCII)  
allowed them. The focus is on mid-1980's textfiles and the world as it
was then,  but even these files are sometime retooled 1960s and 1970s
works, and offshoots   of this culture exist to this day.
'@

[regex]::split($Text, ‘\W+’)

Answer 3

这是一个使用正则表达式的解决方案，它将：

删除特殊字符
根据单词边界解析单词（正则表达式中为\b）

代码：

$Text = @'
On the face of things, we seem to be merely talking about text-based files, containing only 
the letters of the English Alphabet (and the occasional punctuation mark).
On deeper inspection, of course, this isn't quite the case. What this site
offers is a glimpse into the history of writers and artists bound by the 128 
characters that the American Standard Code for Information Interchange (ASCII) 
 allowed them. The focus is on mid-1980's textfiles and the world as it was then, 
but even these files are sometime retooled 1960s and 1970s works, and offshoots 
 of this culture exist to this day.
'@;

# Remove special characters
$Text = $Text -replace '\(|\)|''|\.|,','';
# Match words
$MatchList = ([Regex]'(?<word>\b\w+\b)').Matches($Text);
# Get just the text values of the matches
$WordList = $MatchList | % { $PSItem.Groups['word'].Value; };
# Examine the 'Count' of words
$WordList.Count

结果如下：

$WordList[0..9];
On
the
face
of
things
we
seem
to
be
merely

Answer 4

我不打扰拆分字符串，因为无论如何你将结果写回文件。只需用空格替换所有标点符号（也可能是括号），用换行符替换所有连续的空格，然后将所有内容写回文件：

$in  = 'C:\test.txt'
$out = 'C:\test2.txt'

(Get-Content $in | Out-String) -replace '[.,;:?!()]',' ' -replace '\s+',"`r`n" |
  Set-Content $out

将单词拆分为单词，然后另存为新文件

4 个答案: