我的机智已接近这个问题:基本上,我需要删除单词之间的双倍空格。我的程序恰好是希伯来语,但这是基本的想法:
TITLE: הלכות השכמת הבוקר
注意前两个单词之间有一个额外的空格(Herbew从右到左阅读)。
我尝试了很多很多不同的方法,这里有几个:
# tried all these with and without unicode
title = re.sub(u'\s+',u' ',title.decode('utf-8'))
title = title.replace(" "," ")
title = title.replace(u" הלכות",u" הלכות")
直到最后我才采取了一种非常不必要的方法(粘贴时有些格式化了):
def remove_blanks(s):
word_list = s.split(" ")
final_word_list = []
for word in word_list:
print "word: " +word
#tried every qualifier I could think of...
if not_blank(word) and word!=" " and True != re.match("s*",word):
print "^NOT BLANK^"
final_word_list.append(word)
return ' '.join(final_word_list)
def not_blank(s):
while " " in s:
s = s.replace(" ","")
return (len(s.replace("\n","").replace("\r","").replace("\t",""))!=0);
而且,令我惊讶的是,这就是我的回忆:
word: הלכות
^NOT BLANK^
word: #this should be tagged as Blank!!
^NOT BLANK^
word: השכמת
^NOT BLANK^
word: הבוקר
^NOT BLANK^
显然,我的预选赛并没有奏效。这是怎么回事?
答案 0 :(得分:0)
有一个隐藏的\ xe2 \ x80 \ x8e,LEFT-TO-RIGHT MARK。使用repr(word)找到它。谢谢@mgilson!