Question

我有一个用Python打开的txt文件。我正在尝试删除符号并按字母顺序排列其余单词。删除句点，逗号等不是问题。但是，当我将破折号与其他符号一起添加到列表中时，似乎无法删除带有空格的破折号。

这是我打开的示例：

content = "The quick brown fox who was hungry jumps over the 7-year old lazy dog"

这就是我想要的（已删除句点，并且未附加到单词上的破折号）：

content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"

但是我要么得到了（所有破折号都删除了）：

content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog"

或者这个（未删除破折号）：

f = open("article.txt", "r") # Create variable (Like this removing " - " works) content = f.read() content = content.replace(" - ", " ") # Create list wordlist = content.split() # Which symbols (If I remove the line "content = content.replace(" - ", " ")", the " - " in this list doesn't get removed here) chars = [",", ".", "'", "(", ")", "‘", "’", " - "] # Remove symbols words = [] for element in wordlist: temp = "" for ch in element: if ch not in chars: temp += ch words.append(temp) # Print words, sort alphabetically and do not print duplicates for word in sorted(set(words)): print(word)

这是我的全部代码。添加content.replace（）即可。但这不是我想要的：

content = content.replace(" - ", " ")

它像这样工作。但是，当我删除chars时，content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"中的“空白+破折号+ whitspace”不会被删除。

如果我将其替换为“-”（没有空格），则会得到我不想要的内容：

chars

是否可以使用7-year A B C The brown dog fox hungry jumps lazy old over quick the was who之类的列表执行此操作，还是我唯一选择使用.replace（）进行此操作。

Python是否有一个特殊的原因为什么Python首先要按字母顺序对大写字母排序，然后对不大写的单词分别排序？

就像这样（只是添加了字母ABC以强调我要说的话）：

{{1}}

Answer 1

您可以像这样使用re.sub：

>>> import re
>>> strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
>>> content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
>>> strip_chars.sub("", content)
'The quick brown fox who was hungry jumps over the 7-year old lazy dog'
>>> strip_chars.sub("", content).split()
['The', 'quick', 'brown', 'fox', 'who', 'was', 'hungry', 'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog']
>>> print(*sorted(strip_chars.sub("", content).split()), sep='\n')
7-year
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who

总结我的评论并将其汇总在一起：

from pathlib import Path
from collections import Counter
import re

strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')

article = Path('/path/to/your/article.txt')

content = article.read_text()

words = Counter(strip_chars.sub('', content).split())

for word in sorted(words, key=lambda x: x.lower()):
    print(word)

例如，如果The和the算作重复单词，则只需将content转换为小写字母。代码将改为以下代码：

from pathlib import Path
from collections import Counter
import re

strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')

article = Path('/path/to/your/article.txt')

content = article.read_text().lower()

words = Counter(strip_chars.sub('', content).split())

for word in sorted(words):
    print(word)

最后，作为使用collections.Counter的一个很好的副作用，您还会在words中得到一个单词计数器，并且可以回答诸如“最常见的十个单词是什么？”之类的问题。像这样：

words.most_common(10)

Answer 2

之后

wordlist = content.split()

您的列表不再包含带有开始/结束空格的任何内容。

str.split()

删除连续的空格。因此，您的拆分列表中没有' - '。

Doku：https://docs.python.org/3/library/stdtypes.html#str.split

str。分割（sep =无，maxsplit = -1）


如果未指定sep或为None，则将应用不同的拆分算法：将连续空白的运行视为单个分隔符，并且结果将包含 no如果字符串的开头或结尾有空格，则在开头或结尾处输入空字符串。

替换' - '似乎是正确的-靠近代码的另一种方法是从拆分列表中完全删除'-'：

content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
wordlist = content.split()

print(wordlist)

chars = [",", ".", "'", "(", ")", "‘", "’"]   # modified

words = []
for element in wordlist:
    temp = ""
    if element == '-':             # skip pure -
        continue
    for ch in element:             # handle characters to be removed
        if ch not in chars:
            temp += ch
    words.append(temp)

输出：

['The', 'quick', 'brown', 'fox', '-', 'who', 'was', 'hungry', '-', 
 'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog.']

7-year
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who

尝试删除带有空格的符号（“-”），同时保持符号（“-”）不带空格

2 个答案: