我是python的新手。我想删除重复的单词
除英文单词外,我想删除所有其他单词和空白行。
纯粹的英语单词,我想提取。
我有一些文本文件,其中包含如下内容
aaa
bbb
aaa223
aaa
ccc
ddd
kei60:
sj@6999
jack02
jparkj
所以在处理重复后我想得到
之后的结果aaa
bbb
ccc
ddd
jparkj
以下是我尝试过的脚本源代码。
如果有人帮助我多多欣赏!谢谢!
# read a text file, replace multiple words specified in a dictionary
# write the modified text back to a file
import re
def replace_words(text, word_dic):
"""
take a text and replace words that match a key in a dictionary with
the associated value, return the changed text
"""
rc = re.compile('|'.join(map(re.escape, word_dic)))
def translate(match):
return word_dic[match.group(0)]
return rc.sub(translate, text)
def main():
test_file = "prxtest.txt"
# read the file
fin = open(test_file, "r")
str2 = fin.read()
fin.close()
# the dictionary has target_word:replacement_word pairs
word_dic = {
'.': '\n',
'"': '\n',
'<': '\n',
'>': '\n',
'!': '\n',
"'": '\n',
'(': '\n',
')': '\n',
'[': '\n',
']': '\n',
'@': '\n',
'#': '\n',
'$': '\n',
'%': '\n',
'^': '\n',
"&": '\n',
'*': '\n',
'_': '\n',
'+': '\n',
'-': '\n',
'=': '\n',
'}': '\n',
'{': '\n',
'"': '\n',
";": '\n',
':': '\n',
'?': '\n',
',': '\n',
'`': '\n',
'~': '\n',
'1': '\n',
'2': '\n',
'3': '\n',
'4': '\n',
"5": '\n',
'6': '\n',
'7': '\n',
'8': '\n',
'9': '\n',
'0': '\n',
' ': '\n'}
# call the function and get the changed text
str3 = replace_words(str2, word_dic)
# write changed text back out
fout = open("clean.txt", "w")
fout.write(str3)
fout.close()
if __name__ == "__main__":
main()
答案 0 :(得分:2)
这将捕获仅包含字母的行:
fin = open(test_file, 'r')
fout = open('clean.txt', 'w')
s = set()
for line in fin:
if line.rstrip().isalpha():
if not line in s:
s.add(line)
fout.write(line)
fin.close()
fout.close()
答案 1 :(得分:1)
这样的事情应该有效:
import re
found = []
with open(test_file) as fd:
for line in fd:
word = line.strip()
if word:
if word not in found and re.search(r'^[[:alpha:]]+$', word):
print word
found.append(word)
答案 2 :(得分:0)
我知道这是一个python问题,但你问的问题似乎比使用grep的* nix脚本更简单:
cat infile | grep '^[a-zA-Z]+$' > outfile
如果您只想要包含alpha字符的唯一行:
cat infile | grep '^[a-zA-Z]+$' | sort -u > outfile
我想在python中你可以做到:
import re
inf = open('infile', 'r')
for line in inf:
if (re.match('\A[a-zA-A]+\Z', line):
print line
答案 3 :(得分:0)
想要的输出中的一些字符串可以作为插入,但其他字符串似乎不是英语单词。如果需要纯英语单词,建议采用稍微复杂的方法:
import nltk
from nltk.corpus import words
tokens = nltk.word_tokenize(open('prxtest.txt').read())
en_words = [x for x in tokens if x.lower() in words.words()]
# en_words now contains purely English words
答案 4 :(得分:0)
可以分为两行:
import re
data ="""aaa
bbb
aaa223
aaa
ccc
ddd
kei60:
sj@6999
jack02
jparkj"""
lines = data.splitlines() # use f.readlines() instead if reading from file
# split the words and only take ones that are all alpha
words = filter(lambda x: re.match('^[^\W\d]+$', x), lines)
# remove duplicates and print out
print '\n'.join(set(words))