将单词拆分为单词,然后另存为新文件

时间:2014-01-12 20:25:02

标签: powershell

假设我在C盘中有一个文本文件test.txt。

On the face of things, we seem to be merely talking about text-based files, containing only 
the letters of the English Alphabet (and the occasional punctuation mark).
On deeper inspection, of course, this isn't quite the case. What this site
offers is a glimpse into the history of writers and artists bound by the 128 
characters that the American Standard Code for Information Interchange (ASCII) 
 allowed them. The focus is on mid-1980's textfiles and the world as it was then, 
but even these files are sometime retooled 1960s and 1970s works, and offshoots 
 of this culture exist to this day.

我想将所有行拆分为单词,然后将其另存为新文件。在新文件中,每行只包含一个单词。

因此新文件将是:

       On
       the
       face
       of
       things
       we
       seem
       to
       ....

分隔符是一个空格,请跳过所有标点符号。

4 个答案:

答案 0 :(得分:2)

你还没试过。下次我投票支持封闭式问题。 Powershell使用了99%的c#语法和“all”.Net类,所以如果你知道c#,那么在谷歌上使用5分钟并尝试一些命令,你将在PowerShell中走得更远。

#create array
$words = @()

#read file
$lines = [System.IO.File]::ReadAllLines("C:\Users\Frode\Desktop\in.txt")

#split words
foreach ($line in $lines) {
    $words += $line.Split(" ,.", [System.StringSplitOptions]::RemoveEmptyEntries)
}

#save words
[System.IO.File]::WriteAllLines("C:\Users\Frode\Desktop\out.txt", $words)

在PowerShell中你也可以这样做:

Get-Content .\in.txt | ForEach-Object { 
    $_.Split(" ,.", [System.StringSplitOptions]::RemoveEmptyEntries) 
} | Set-Content out.txt

答案 1 :(得分:1)

$Text = @'
On the face of things, we seem to be merely talking about
text-based files, containing only  the letters of the English Alphabet
(and the occasional punctuation mark). On deeper inspection, of
course, this isn't quite the case. What this site offers is a glimpse
into the history of writers and artists bound by the 128  characters
that the American Standard Code for Information Interchange (ASCII)  
allowed them. The focus is on mid-1980's textfiles and the world as it
was then,  but even these files are sometime retooled 1960s and 1970s
works, and offshoots   of this culture exist to this day.
'@

[regex]::split($Text, ‘\W+’)

答案 2 :(得分:0)

这是一个使用正则表达式的解决方案,它将:

  • 删除特殊字符
  • 根据单词边界解析单词(正则表达式中为\b

代码:

$Text = @'
On the face of things, we seem to be merely talking about text-based files, containing only 
the letters of the English Alphabet (and the occasional punctuation mark).
On deeper inspection, of course, this isn't quite the case. What this site
offers is a glimpse into the history of writers and artists bound by the 128 
characters that the American Standard Code for Information Interchange (ASCII) 
 allowed them. The focus is on mid-1980's textfiles and the world as it was then, 
but even these files are sometime retooled 1960s and 1970s works, and offshoots 
 of this culture exist to this day.
'@;

# Remove special characters
$Text = $Text -replace '\(|\)|''|\.|,','';
# Match words
$MatchList = ([Regex]'(?<word>\b\w+\b)').Matches($Text);
# Get just the text values of the matches
$WordList = $MatchList | % { $PSItem.Groups['word'].Value; };
# Examine the 'Count' of words
$WordList.Count

结果如下:

$WordList[0..9];
On
the
face
of
things
we
seem
to
be
merely

答案 3 :(得分:0)

我不打扰拆分字符串,因为无论如何你将结果写回文件。只需用空格替换所有标点符号(也可能是括号),用换行符替换所有连续的空格,然后将所有内容写回文件:

$in  = 'C:\test.txt'
$out = 'C:\test2.txt'

(Get-Content $in | Out-String) -replace '[.,;:?!()]',' ' -replace '\s+',"`r`n" |
  Set-Content $out