为mysql db中存在的特定列生成n-gram

时间:2019-06-11 09:37:33

标签: mysql python-2.7 n-gram

我正在编写代码,通过读取特定列为表中的每个记录生成n元语法。

def extract_from_db(inp_cust_id):
    sql_db = TatDBHelper()
    t_sql = "select notes from raw_data where customer_id = {0}"
    db_data = sql_db.execute_read(t_sql.format(inp_cust_id))
    for row in db_data:
        text = row.values()
        bi_grams = generate_ngrams(text[0].encode("utf-8"), 2)
        print bi_grams

def generate_ngrams(sentence, n):
    sentence = sentence.lower()
    sentence = re.sub(r'[^a-zA-Z0-9\s]', ' ', sentence)
    tokens = [token for token in sentence.split(" ") if token != ""]
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

我得到的输出如下:

['i highly', 'highly recommend', 'recommend it']
['the penguin', 'penguin encounter', 'encounter was', 'was awesome']

我希望输出如下所示,有人可以帮我得到这个吗?

['i highly',
 'highly recommend',
 'recommend it',
 ...
]

1 个答案:

答案 0 :(得分:0)

创建另一个列表all_ngrams,并使用.extend()继续将值附加到该列表中,最后将所有ngrams放在一个列表中。

尝试一下:

def extract_from_db(inp_cust_id):
    sql_db = TatDBHelper()
    t_sql = "select notes from raw_data where customer_id = {0}"
    db_data = sql_db.execute_read(t_sql.format(inp_cust_id))
    all_ngrams = []
    for row in db_data:
        text = row.values()
        bi_grams = generate_ngrams(text[0].encode("utf-8"), 2)
        all_ngrams.extend(bi_grams)
    print all_ngrams