Question

我正在逐句逐字分析 “嘿那里!!这是一部很棒的电影???”

我有很多像上面这样的句子。我有一个巨大的数据集文件，如下所示，如果该单词存在，我必须快速查找。如果是，则进行分析并存储在字典中，例如从单词的文件，句子的最后一个单词的分数，句子的第一个单词等获得分数。

句子[i] =＆gt;嘿！！这是一部优秀的电影??? 句子[0] =嘿，句子[1] =那里!!句子[2] =这个等等。

以下是代码：

def unigrams_nrc(file):
   for line in file:
       (term,score,numPos,numNeg) = re.split("\t", line.strip())
       if re.match(sentence[i],term.lower()):
          #presence or absence of unigrams of a target term
          wordanalysis["unigram"] = found
       else:
          found = False
       if found:
          wordanalysis["trail_unigram"] = found if re.match(sentence[(len(sentence)-1)],term.lower()) else not(found)
          wordanalysis["lead_unigram"] = found  if re.match(sentence[0],term.lower()) else not(found)
          wordanalysis["nonzero_sscore"] = float(score) if (float(score) != 0) else 0             
          wordanalysis["sscore>0"] = (float(score) > 0)
          wordanalysis["sscore"] = (float(score) != 0)

       if re.match(tweet[len(sentence)-1],term.lower()):
          wordanalysis["sscore !=0 last token"] = (float(score) != 0)

这是文件（此文件中超过4000个单词）：

#fabulous   7.526   2301    2
#excellent  7.247   2612    3
#superb 7.199   1660    2
#perfection 7.099   3004    4
#terrific   6.922   629 1
#magnificent    6.672   490 1
#sensational    6.529   849 2
#heavenly   6.484   2841    7
#ideal  6.461   3172    8
#partytime  6.111   559 2
#excellence 5.875   1325    6
@thisisangel    5.858   217 1
#wonderful  5.727   3428    18
elegant 5.665   537 3
#perfect    5.572   3749    23
#fine   5.423   2389    17
excellence  5.416   279 2
#realestate 5.214   114 1
bicycles    5.205   113 1

我想知道是否有更好的方法来做到这一点？定义更好的方法：更快，更少的代码和优雅。我是python的新手，所以我知道这不是最好的代码。我有大约4个文件，我必须通过它检查分数，因此希望以最佳方式实现此功能。

Answer 1

以下是我的提示：

使用json.dumps()
使用json.laods()
将分析中的数据加载分离为单独的逻辑代码块。例如：功能

对于复杂度为O（1）的查找，Python dict（s）比具有O（n）的迭代要快得多 - 所以只要你加载，你就会获得一些性能上的好处你的数据文件最初。

<强>实施例（一个或多个）：

from json import dumps, loads


def load_data(filename):
    return json.loads(open(filename, "r").read())

def save_data(filename, data):
    with open(filename, "w") as f:
        f.write(dumps(data))

data = load_data("data.json")

foo = data["word"]  # O(1) lookup of "word"

我可能会存储您的数据：

data = {
    "fabulous": [7.526, 2301, 2],
    ...
}

然后你会这样做：

stats = data.get(word, None)
if stats is not None:
    score, x, y = stats
    ...

注意： ...是不真实代码和占位符，您应该在哪里填写空白。

Answer 2

也许将单词/得分文件一次加载到内存中作为dicts的词典，然后遍历每个单词中的每个单词，检查单词文件中的单词键，用于句子中的每个单词。

这样的事情会起作用吗？

word_lookup = load_words(file)
for s in sentences:
    run_sentence(s)

def load_words(file):
    word_lookup = {}
    for line in file:
        (term,score,numPos,numNeg) = re.split("\t", line.strip())
        if not words.has_key(term):
            words[term] = {'score': score, 'numPos': numPos, 'numNeg': numNeg}
    return word_lookup

def run_sentence(s):
    s = standardize_sentence(s) # Assuming you want to strip punctuation, symbols, convert to lowercase, etc
    words = s.split(' ')
    first = words[0]
    last = words[-1]
    for word in words:
        word_info = check_word(word)
        if word_info:
            # Matched word, use your scores somehow (word_info['score'], etc)

def check_word(word):
    if word_lookup.has_key(word):
        return word_lookup[word]
    else:
        return None

从文件python中分析和评分

2 个答案: