Question

SO。我希望有人可以为我正在尝试做的事情带来一些清晰的表达。

以下是我想做的事情：如果在我指定的（不同的和单独的）术语列表中找到文档的措辞（下文中称为“terms_list”），则使用文档索引更改/扩展（不同和单独）列表。

目前，我已经放置了比较术语列表所需的所有代码，并在单个函数中扩展了索引列表。为了确保我查看数据库中的每个文档，我为每个terms_list调用该函数。我想要做的是检查我们数据库中的每个文档，看它是否与我们的terms_list中的任何条款匹配。如果是，那么我想将该文档的索引添加到一个索引列表中，这些索引仅在找到文档时应用或仅与正在检查的terms_list相关。我希望这是有道理的。

也许有些代码可以帮助澄清一点：

首先，我创建我的terms_list列表及其相关的索引列表（首先为空）。

something_word_list       = ['term_a', 'term_b', 'term_c']
something_idx_list        = []

another_thing_word_list   = ['term_a', 'term_b', 'term_c']
another_thing_idx_list    = []

然后我创建了一个函数，用于检查当前正在检查的terms_list，并在必要/适当时扩展idx_list - 它始终是。

def get_docs_n_build_idx_lists(terms_n_indices):
    subcat_word_list, subcat_idx_list = terms_n_indices[0], terms_n_indices[1]
    ##initialize db cursor:
    db_conn = crawler_library.connect_to_db("events")
    cursor  = db_conn.cursor()
    ##make query:
    query = "SELECT event_title,description,extra_info,venue_name,idx FROM events WHERE 1"
    #execute the query and catch any errors that show up and print them so I am not flying blind
    try:
        cursor.execute(query)
    except MySQLdb.Error, e:
        print("MySQL Error [%d]: %s") % (e.args[0], e.args[1])
        crawler_library.close_db_connection(db_conn)
    #loop through all results in the query set, one row at-a-time


    if cursor.rowcount > 0: #don't bother doing anything if we don't get anything from the database
        data = cursor.fetchall()
        for row in data:
            temp_string = word_tokenize(nltk.clean_html(str(row[0]).strip(string.digits
            + string.punctuation).lower() +" "+str(row[1]).strip(string.digits +
            string.punctuation).lower() +" "+str(row[2]).strip(string.digits +
            string.punctuation).lower()+" "+str(row[3]).strip(string.digits +
            string.punctuation).lower()))

            temp_string     = [' '.join(word.strip(string.punctuation).split()) for
                       word in temp_string if word not in stopwords and len(word) >= 3]

            bigrams         = nltk.bigrams(word_tokenize(str(' '.join(temp_string))))
            all_terms_list  = temp_string + [str(bigram).replace(",","").replace("'",
                              "").strip("()") for bigram in bigrams]

            subcat_idx_list.extend((str(row[4]) for word in subcat_word_list if word in
            all_terms_list))     

print subcat_idx_list
print "______________________________", "\n"

然后我调用get_docs_n_build_idx_lists函数，传递一个（terms_list，idx_list）元组：

for terms_n_indices in terms_n_indices_list:
    get_docs_n_build_idx_lists(terms_n_indices)

此代码打印出单独的索引列表，因此我们希望它能够正常工作。然而，我不知道怎么做的是*如何将这些idx_lists与他们的特定名称*联系起来（例如，如何扩展another_thing_idx_list，例如，如果我将数据库文档的术语与术语内的术语进行比较another_thing_word_list）。我需要/希望函数每次都扩展一个特定的 idx_list，所以我以后可以使用它。注意：全局变量不起作用（当然我不完全知道为什么）。注意另外：我可能依赖于一堆条件，但我想避免它作为解决方案，因为term_lists和idx_lists的数量几乎肯定会增长。

有人请解释最简单或最pythonic的做法吗？无论多小，都要提前感谢您的帮助。

改变函数之外的不同列表。如何确保我改变正确的清单以及如何保存？

0 个答案: