我使用以下代码来标记字符串,从stdin读取。
d=[]
cur = ''
for i in sys.stdin.readline():
if i in ' .':
if cur not in d and (cur != ''):
d.append(cur)
cur = ''
else:
cur = cur + i.lower()
这给了我一系列不重复的单词。但是,在我的输出中,有些单词不会被分割。
我的输入是
Dan went to the north pole to lead an expedition during summer.
输出数组d是
['dan','去','到',''','北','极','宽容','一个','远征',''','夏天']
为什么tolead
在一起?
答案 0 :(得分:3)
试试这个
d=[]
cur = ''
for i in sys.stdin.readline():
if i in ' .':
if cur not in d and (cur != ''):
d.append(cur)
cur = '' # note the different indentation
else:
cur = cur + i.lower()
答案 1 :(得分:1)
试试这个:
for line in sys.stdin.readline():
res = set(word.lower() for word in line[:-1].split(" "))
print res
示例:
line = "Dan went to the north pole to lead an expedition during summer."
res = set(word.lower() for word in line[:-1].split(" "))
print res
set(['north', 'lead', 'expedition', 'dan', 'an', 'to', 'pole', 'during', 'went', 'summer', 'the'])
评论后,我编辑:此解决方案保留输入顺序并过滤分隔符
import re
from collections import OrderedDict
line = "Dan went to the north pole to lead an expedition during summer."
list(OrderedDict.fromkeys(re.findall(r"[\w']+", line)))
# ['Dan', 'went', 'to', 'the', 'north', 'pole', 'lead', 'an', 'expedition', 'during', 'summer']
答案 2 :(得分:1)
"to"
已在"d"
。因此,您的循环会跳过"to"
和"lead"
之间的空格,但会继续连接;一旦它到达下一个空格,就会发现"tolead"
中d
不在>>> import string
>>> set("Dan went to the north pole to lead an expedition during summer.".translate(None, string.punctuation).lower().split())
set(['summer', 'north', 'lead', 'expedition', 'dan', 'an', 'to', 'pole', 'during', 'went', 'the'])
,所以它会附加它。
更简单的解决方案;它还会删除所有形式的标点符号:
{{1}}