如何使用find在python中查找近似单词

时间:2012-12-07 06:24:05

标签: python

我需要在很多句子中找到一个单词的第一个字符。所有的句子都有某种形式的“猜想”,即猜想,猜想等等。但是我不能在“找到”里面使用通配符这样

firstSpace = mySentence.find('conjecture'*,0)

句子看起来像:

'There is considerable conjecture and debate as to how...'
'He conjectured that the interface was...'

任何想法我该怎么办? 谢谢!

3 个答案:

答案 0 :(得分:4)

您可以先尝试删除特殊字符:

x = '“ There is considerable conjecture and debate as to how...

newx = ''.join(e for e in x.lower() if e.isalnum())

print newx

>>> 'thereisconsiderableconjectureanddebateastohow'

然后使用find找到您的单词。

祝你好运!

编辑:

如果要查找指定单词之前的单词,可以拆分句子。这是一段可能有用的代码:

paragraph = 'The quick brown fox jumps over the lazy dog. I have two big dogs. Furry Dogs are the best. $%^Dogs love me.'
paragraph = ''.join(e for e in paragraph.lower() if e.isalnum() or e.isspace() or e=='.')
sentence_list = paragraph.split('.')
prev_word_list = []
for sentence in sentence_list:
    word_list = sentence.split()
    prev_word = ''
    for i,word in enumerate(word_list):
        if i == 0:
            pass
        else:
            if 'dog' in word.lower():
                prev_word = word_list[i-1]
                prev_word_list.append(prev_word)

这给出了:

>>> print prev_word_list
>>> ['lazy', 'big', 'furry']

答案 1 :(得分:2)

  

所有句子都有某种形式的“猜想”,即猜想,猜想等等。

其他答案中显示的

word in string方法通常会失败,例如,他们在句子中找不到community的单词communities

在这种情况下,您可能需要一个词干算法,例如nltk.stem package提供的

from nltk.stem.snowball import EnglishStemmer
from nltk import word_tokenize

stemmer = EnglishStemmer()
stem_word = stemmer.stem

stem = stem_word(u"conjecture")
sentence = u'He conjectured that the interface was...'
words = word_tokenize(sentence)
found_words = [(i, w) for i, w in enumerate(words) if stem_word(w) == stem]
# -> [(1, u'conjectured')]

根据您的需要,还可以使用其他词干和tokenize methods in nltk

  

然而有些词从讨厌的字符开始:“或者类似的......我怎么能摆脱它们呢?

“讨厌的字符”是错误地将utf-8字节序列视为cp1252的结果:

>>> utf8bytes = u"microsoft smart quote (\u201c)".encode('utf-8')
>>> print utf8bytes.decode('cp1252')
microsoft smart quote (“)
>>> print utf8bytes.decode('utf-8')
microsoft smart quote (“)

您不应盲目删除乱码,而应修改字符编码。

Why the #AskObama Tweet was Garbled on Screen: Know your UTF-8, Unicode, ASCII and ANSI Decoding Mr. President在电视上播放了这个问题的例子。

了解阅读The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

答案 2 :(得分:1)

忘记在后台实际执行的隐含工作,这至少可以实现你所要求的任务(希望如此)。

unicodedata.normalize('NFKD', mySentence).encode('ascii', 'ignore').lower().find("conjecture")

嗯,我说实话,我期待一个正则表达式为你设置一个线性搜索,但unicode值通常会分成两个“字符”。

相反,这是一个至少完成工作的黑客:

newSentence = ""
for i in range(0, len(mySentence)):
   if ord(mySentence[i]) > 128:
         newSentence += '_'
   else:
         newSentence += mySentence[i]

newSentence.encode("UTF-8").lower().find("conjecture")

如果你想忘记那些讨厌的编码字符:

mySentence.decode("ascii", "ignore").encode("UTF-8").lower().find("conjecture")



Sample input: >>> newStr = "“32f fWF  3(*&(%FJ   conJectuRe€@!O".decode("ascii", "ignore").encode("UTF-8").lower()
              >>> print newStr
              >>> print newStr.find("conjecture")

Output:       '32f fwf  3(*&(%fj   conjecture@!o'
              20