我正在尝试创建一组单词,这意味着严格意义上只有.txt文件中的字母表。此txt文件包含所有可能的字符,包括不可打印的文本。
没有re或集合库。 Python 3
例如,给定一个读取
的.txt文件*eBooks$ Readable By Both Humans and By Computers, Since 1971**
*These# eBooks@ Were Prepared By Thousands of Volunteers!
我需要我的套装
{'eBooks', 'Readable', 'By', 'Both', 'Humans', 'and', 'Computers', 'Since', 'These', 'Were', 'Prepared', 'Thousands', 'of', 'Volunteers'}
这是我所做的,但我仍然在我的套装上获得特殊字符和数字。我只想要字母
import string
filecontent = []
word_set = {}
with open ("small.txt") as myFile:
for line in myFile:
line = line.rstrip()
line = line.replace("\t","")
for character in line:
if character in string.digits or character in string.punctuation:
line = line.replace(character, "")
if line != "":
filecontent.append(line)
lowerCase = [x.lower() for x in filecontent]
word_set = {word for line in lowerCase for word in line.split()}
答案 0 :(得分:2)
您可以这样做:
>>> from string import punctuation
>>> def solve(s):
for line in s.splitlines():
for word in line.split():
word = word.strip(punctuation)
if word.translate(None, punctuation).isalpha():
yield word
...
>>> s = '''*eBooks$ Readable By Both Humans and By Computers, Since 1971**
*These# eBooks@ Were Prepared By Thousands of Volunteers!'''
>>> set(solve(s))
set(['and', 'Both', 'Since', 'These', 'Readable', 'Computers', 'Humans', 'Prepared', 'of', 'Were', 'Volunteers', 'Thousands', 'By', 'eBooks'])
如果您使用的是Python 3,那么您需要将str.translate
部分替换为:
table = dict.fromkeys(map(ord, punctuation)) #add this at the top of function
...
if word.translate(table).isalpha():
...
答案 1 :(得分:0)
这是使用re
正则表达式模块的解决方案。它还提供字数统计,但如果您不想要,您只需使用密钥,或将其交换为一组。
text = """*eBooks$ Readable By Both Humans and By Computers, Since 1971**
*These# eBooks@ Were Prepared By Thousands of Volunteers!"""
import re
from collections import Counter
words = Counter()
regex = re.compile(r"[a-zA-Z]+")
matches = regex.findall(text)
for match in matches:
words[match.lower()] += 1
print words
或者,如果你把它放在一个文件中;
with open("fileName") as textFile:
text = "".join(textFile.readLines()) #Necesary to turn the file into one long string, rather than an array of lines.
matches = regex.findall(text)
for match in matches:
words[match.lower()] += 1
哪个给出了
Counter({'by': 3, 'ebooks': 2, 'and': 1, 'both': 1, 'since': 1, 'these': 1, 'readable': 1, 'computers': 1, 'humans': 1, '1971': 1, 'prepared': 1, 'of': 1, 'were': 1, 'volunteers': 1, 'thousands': 1})
答案 2 :(得分:0)
如果我是你,我使用了re.findall
import re
s = '''*eBooks$ Readable By Both Humans and By Computers, Since 1971**
*These# eBooks@ Were Prepared By Thousands of Volunteers!'''
set(re.findall('[a-zA-Z]+',s))
输出
set(['and', 'Both', 'Since', 'These', 'Readable', 'Computers', 'Humans', 'Prepared', 'of', 'Were', 'Volunteers', 'Thousands', 'By', 'eBooks'])