尝试删除带有空格的符号(“-”),同时保持符号(“-”)不带空格

时间:2019-12-29 13:09:17

标签: python python-3.x whitespace symbols alphabetical-sort

我有一个用Python打开的txt文件。我正在尝试删除符号并按字母顺序排列其余单词。删除句点,逗号等不是问题。但是,当我将破折号与其他符号一起添加到列表中时,似乎无法删除带有空格的破折号。

这是我打开的示例:

content = "The quick brown fox who was hungry jumps over the 7-year old lazy dog"

这就是我想要的(已删除句点,并且未附加到单词上的破折号):

content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"

但是我要么得到了(所有破折号都删除了):

content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog"

或者这个(未删除破折号):

f = open("article.txt", "r") # Create variable (Like this removing " - " works) content = f.read() content = content.replace(" - ", " ") # Create list wordlist = content.split() # Which symbols (If I remove the line "content = content.replace(" - ", " ")", the " - " in this list doesn't get removed here) chars = [",", ".", "'", "(", ")", "‘", "’", " - "] # Remove symbols words = [] for element in wordlist: temp = "" for ch in element: if ch not in chars: temp += ch words.append(temp) # Print words, sort alphabetically and do not print duplicates for word in sorted(set(words)): print(word)

这是我的全部代码。添加content.replace()即可。但这不是我想要的:

content = content.replace(" - ", " ")

它像这样工作。但是,当我删除chars时,content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"中的“空白+破折号+ whitspace”不会被删除。

如果我将其替换为“-”(没有空格),则会得到我不想要的内容:

chars

是否可以使用7-year A B C The brown dog fox hungry jumps lazy old over quick the was who 之类的列表执行此操作,还是我唯一选择使用.replace()进行此操作。

Python是否有一个特殊的原因为什么Python首先要按字母顺序对大写字母排序,然后对不大写的单词分别排序?

就像这样(只是添加了字母ABC以强调我要说的话):

{{1}}

2 个答案:

答案 0 :(得分:1)

您可以像这样使用re.sub

>>> import re
>>> strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
>>> content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
>>> strip_chars.sub("", content)
'The quick brown fox who was hungry jumps over the 7-year old lazy dog'
>>> strip_chars.sub("", content).split()
['The', 'quick', 'brown', 'fox', 'who', 'was', 'hungry', 'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog']
>>> print(*sorted(strip_chars.sub("", content).split()), sep='\n')
7-year
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who

总结我的评论并将其汇总在一起:

from pathlib import Path
from collections import Counter
import re

strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')

article = Path('/path/to/your/article.txt')

content = article.read_text()

words = Counter(strip_chars.sub('', content).split())

for word in sorted(words, key=lambda x: x.lower()):
    print(word)

例如,如果Thethe算作重复单词,则只需将content转换为小写字母。代码将改为以下代码:

from pathlib import Path
from collections import Counter
import re

strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')

article = Path('/path/to/your/article.txt')

content = article.read_text().lower()

words = Counter(strip_chars.sub('', content).split())

for word in sorted(words):
    print(word)

最后,作为使用collections.Counter的一个很好的副作用,您还会在words中得到一个单词计数器,并且可以回答诸如“最常见的十个单词是什么?”之类的问题。像这样:

words.most_common(10)

答案 1 :(得分:0)

之后

wordlist = content.split()

您的列表不再包含带有开始/结束空格的任何内容。

str.split() 

删除连续的空格。因此,您的拆分列表中没有' - '

Doku:https://docs.python.org/3/library/stdtypes.html#str.split

  
      
  • str。分割(sep =无,maxsplit = -1)
  •   
     

如果未指定sep或为None,则将应用不同的拆分算法:将连续空白的运行视为单个分隔符,并且结果将包含 no如果字符串的开头或结尾有空格,则在开头或结尾处输入空字符串


替换' - '似乎是正确的-靠近代码的另一种方法是从拆分列表中完全删除'-'

content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
wordlist = content.split()

print(wordlist)

chars = [",", ".", "'", "(", ")", "‘", "’"]   # modified

words = []
for element in wordlist:
    temp = ""
    if element == '-':             # skip pure -
        continue
    for ch in element:             # handle characters to be removed
        if ch not in chars:
            temp += ch
    words.append(temp)

输出:

['The', 'quick', 'brown', 'fox', '-', 'who', 'was', 'hungry', '-', 
 'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog.']

7-year
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who