如何在我的代码中实现re.search()?

时间:2017-03-05 11:27:11

标签: python regex nlp

我正在处理带有文本数据的二进制分类问题。我想基于它们在我选择的一些定义明确的Word类特征中的外观来对文本的词进行分类。 现在,我一直在搜索每个单词类中整个文本单词的出现,并在匹配时递增该单词类的计数。该计数进一步用于计算每个单词类的频率。这是我的代码:

import nltk
import re

def wordClassFeatures(text):
    home = """woke home sleep today eat tired wake watch
        watched dinner ate bed day house tv early boring
        yesterday watching sit"""

    conversation = """know people think person tell feel friends
talk new talking mean ask understand feelings care thinking
friend relationship realize question answer saying"""


    countHome = countConversation =0

    totalWords = len(text.split())

    text = text.lower()
    text = nltk.word_tokenize(text)
    conversation = nltk.word_tokenize(conversation)
    home = nltk.word_tokenize(home)
'''
    for word in text:
        if word in conversation: #this is my current approach
            countConversation += 1
        if word in home:
            countHome += 1
'''

    for word in text:
        if re.search(word, conversation): #this is what I want to implement
            countConversation += 1
        if re.search(word, home):
            countHome += 1

    countConversation /= 1.0*totalWords
    countHome /= 1.0*totalWords

    return(countHome,countConversation)

text = """ Long time no see. Like always I was rewriting it from scratch a couple of times. But nevertheless 
it's still java and now it uses metropolis sampling to help that poor path tracing converge. Btw. I did MLT on 
yesterday evening after 2 beers (it had to be Ballmer peak). Altough the implementation is still very fresh it 
easily outperforms standard path tracing, what is to be seen especially when difficult caustics are involved. 
I've implemented spectral rendering too, it was very easy actually, cause all computations on wavelengths are 
linear just like rgb. But then I realised that even if it does feel more physically correct to do so, whats the 
point? 3d applications are operating in rgb color space, and because I cant represent a rgb color as spectrum 
interchangeably I have to approximate it, so as long as I'm not running a physical simulation or something I don't
see the benefits (please correct me if I'm wrong), thus I abandoned that."""

print(wordClassFeatures(text))

这个缺点是我现在有一个额外的开销来阻止所有单词类的每个单词,因为文本中的单词必须明确匹配才能落入单词类。因此,我现在尝试将文本的每个单词作为正则表达式输入,并在每个单词类中搜索它。 这会引发错误:

line 362, in wordClassFeatures
if re.search(conversation, word):
  File "/root/anaconda3/lib/python3.6/re.py", line 182, in search
    return _compile(pattern, flags).search(string)
  File "/root/anaconda3/lib/python3.6/re.py", line 289, in _compile
    p, loc = _cache[type(pattern), pattern, flags]
TypeError: unhashable type: 'list'

我知道语法中存在一个重大错误,但我无法在网上找到它,因为re.search的大多数语法都采用以下格式:

  

re.search("谢谢|欣赏|推进",x)

有没有办法正确实现这个?

1 个答案:

答案 0 :(得分:0)

我相信re.search正在寻找代码正在为对话提供的stringbuffer而非list变量。

另外,当你tokenizing时,你正在使用 text 的所有特殊字符进行搜索。

因此,首先我们需要删除特殊字符的 text

text = re.sub('\W+',' ', text ) #strip text of all special characters

接下来,我们保留会话主页变量(以字符串格式),而不是tokenize

#conversation = nltk.word_tokenize(conversation)
#home = nltk.word_tokenize(home)

我们得到了理想的回答:

(0.21301775147928995, 0.20118343195266272)

以下完整代码:

import nltk
import re

def wordClassFeatures(text):
    home = """woke home sleep today eat tired wake watch
        watched dinner ate bed day house tv early boring
        yesterday watching sit"""

    conversation = """know people think person tell feel friends
talk new talking mean ask understand feelings care thinking
friend relationship realize question answer saying"""

    text = re.sub('\W+',' ', text ) #strip text of all special characters

    countHome = countConversation =0

    totalWords = len(text.split())

    text = text.lower()
    text = nltk.word_tokenize(text)
    #conversation = nltk.word_tokenize(conversation)
    #home = nltk.word_tokenize(home)
    '''
        for word in text:
            if word in conversation: #this is my current approach
                countConversation += 1
            if word in home:
                countHome += 1
    '''

    for word in text:
        if re.search(word, conversation): #this is what I want to implement
            countConversation += 1
        if re.search(word, home):
            countHome += 1

    countConversation /= 1.0*totalWords
    countHome /= 1.0*totalWords

    return(countHome,countConversation)

text = """ Long time no see. Like always I was rewriting it from scratch a couple of times. But nevertheless 
it's still java and now it uses metropolis sampling to help that poor path tracing converge. Btw. I did MLT on 
yesterday evening after 2 beers (it had to be Ballmer peak). Altough the implementation is still very fresh it 
easily outperforms standard path tracing, what is to be seen especially when difficult caustics are involved. 
I've implemented spectral rendering too, it was very easy actually, cause all computations on wavelengths are 
linear just like rgb. But then I realised that even if it does feel more physically correct to do so, whats the 
point? 3d applications are operating in rgb color space, and because I cant represent a rgb color as spectrum 
interchangeably I have to approximate it, so as long as I'm not running a physical simulation or something I don't
see the benefits (please correct me if I'm wrong), thus I abandoned that."""

print(wordClassFeatures(text))