我有一个用Python打开的txt文件。我正在尝试删除符号并按字母顺序排列其余单词。删除句点,逗号等不是问题。但是,当我将破折号与其他符号一起添加到列表中时,似乎无法删除带有空格的破折号。
这是我打开的示例:
content = "The quick brown fox who was hungry jumps over the 7-year old lazy dog"
这就是我想要的(已删除句点,并且未附加到单词上的破折号):
content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"
但是我要么得到了(所有破折号都删除了):
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog"
或者这个(未删除破折号):
f = open("article.txt", "r")
# Create variable (Like this removing " - " works)
content = f.read()
content = content.replace(" - ", " ")
# Create list
wordlist = content.split()
# Which symbols (If I remove the line "content = content.replace(" - ", " ")", the " - " in this list doesn't get removed here)
chars = [",", ".", "'", "(", ")", "‘", "’", " - "]
# Remove symbols
words = []
for element in wordlist:
temp = ""
for ch in element:
if ch not in chars:
temp += ch
words.append(temp)
# Print words, sort alphabetically and do not print duplicates
for word in sorted(set(words)):
print(word)
这是我的全部代码。添加content.replace()即可。但这不是我想要的:
content = content.replace(" - ", " ")
它像这样工作。但是,当我删除chars
时,content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"
中的“空白+破折号+ whitspace”不会被删除。
如果我将其替换为“-”(没有空格),则会得到我不想要的内容:
chars
是否可以使用7-year
A
B
C
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who
之类的列表执行此操作,还是我唯一选择使用.replace()进行此操作。
Python是否有一个特殊的原因为什么Python首先要按字母顺序对大写字母排序,然后对不大写的单词分别排序?
就像这样(只是添加了字母ABC以强调我要说的话):
{{1}}
答案 0 :(得分:1)
您可以像这样使用re.sub
:
>>> import re
>>> strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
>>> content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
>>> strip_chars.sub("", content)
'The quick brown fox who was hungry jumps over the 7-year old lazy dog'
>>> strip_chars.sub("", content).split()
['The', 'quick', 'brown', 'fox', 'who', 'was', 'hungry', 'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog']
>>> print(*sorted(strip_chars.sub("", content).split()), sep='\n')
7-year
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who
总结我的评论并将其汇总在一起:
from pathlib import Path
from collections import Counter
import re
strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
article = Path('/path/to/your/article.txt')
content = article.read_text()
words = Counter(strip_chars.sub('', content).split())
for word in sorted(words, key=lambda x: x.lower()):
print(word)
例如,如果The
和the
算作重复单词,则只需将content
转换为小写字母。代码将改为以下代码:
from pathlib import Path
from collections import Counter
import re
strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
article = Path('/path/to/your/article.txt')
content = article.read_text().lower()
words = Counter(strip_chars.sub('', content).split())
for word in sorted(words):
print(word)
最后,作为使用collections.Counter
的一个很好的副作用,您还会在words
中得到一个单词计数器,并且可以回答诸如“最常见的十个单词是什么?”之类的问题。像这样:
words.most_common(10)
答案 1 :(得分:0)
之后
wordlist = content.split()
您的列表不再包含带有开始/结束空格的任何内容。
str.split()
删除连续的空格。因此,您的拆分列表中没有' - '
。
Doku:https://docs.python.org/3/library/stdtypes.html#str.split
- str。分割(sep =无,maxsplit = -1)
如果未指定
sep
或为None,则将应用不同的拆分算法:将连续空白的运行视为单个分隔符,并且结果将包含 no如果字符串的开头或结尾有空格,则在开头或结尾处输入空字符串。
替换' - '
似乎是正确的-靠近代码的另一种方法是从拆分列表中完全删除'-'
:
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
wordlist = content.split()
print(wordlist)
chars = [",", ".", "'", "(", ")", "‘", "’"] # modified
words = []
for element in wordlist:
temp = ""
if element == '-': # skip pure -
continue
for ch in element: # handle characters to be removed
if ch not in chars:
temp += ch
words.append(temp)
输出:
['The', 'quick', 'brown', 'fox', '-', 'who', 'was', 'hungry', '-',
'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog.']
7-year
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who