NLTK同时执行Bigrams和Trigrams顺序错误

时间:2018-11-28 21:05:34

标签: python nltk

我正在尝试将文本传递到下面的脚本中,并使其输出双字母组和三字母组。这就像第六代尝试一样,由于其他原因,它与所有其他尝试仅生成第一个n-gram,而不会生成另一个。我尝试过切换订单,尝试过各种方法。

这是当前的脚本:

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import os
import sys
from datetime import datetime, timedelta
import random
import nltk
from nltk.collocations import *
import re
import json
from pprint import pprint


def bigram_generator(important_words, gram_dict):
    finder = BigramCollocationFinder.from_words(important_words, 2)

    for bigram, count in finder.ngram_fd.items():
          gram_dict[' '.join(bigram)] = count

    return gram_dict

def trigram_generator(important_words, gram_dict):
    finder1 = TrigramCollocationFinder.from_words(important_words, 3)

    for trigram, count in finder1.ngram_fd.items():
          gram_dict[' '.join(trigram)] = count

    return gram_dict

def execute_gram_analysis2(important_words):
    bigram_dict = {}
    for x in range(1,10):
        bigram_dict = bigram_generator(important_words, bigram_dict)

    trigram_dict = {}
    for y in range(1,10):
        trigram_dict = trigram_generator(important_words, trigram_dict)

    return bigram_dict, trigram_dict

def convert_gram_dict_to_json(gram_dict):
    json_grams_dict = json.dumps(gram_dict, ensure_ascii=False)
    return json_grams_dict


stopwords = nltk.corpus.stopwords.words('english')

scraped_url_id = 2

s = scraped_urls.select().where(scraped_urls.c.id==scraped_url_id)
results = monitor_bot_conn.execute(s)
for row in results:
    row_id = row[0]
    text = row[6]

    print (text)

    words = re.findall(r'\w+', text.decode('utf-8'))

    words_lowercase = []
    for word in words:
        words_lowercase.append(word.lower())


    important_words = filter(lambda x: x not in stopwords, words_lowercase)

    bigrams_dict, trigrams_dict = execute_gram_analysis2(important_words)
    json_bigrams_dict = convert_gram_dict_to_json(bigrams_dict)
    print ('\n\n---[ BIGRAMS ]---\n\n')
    pprint (json_bigrams_dict)

    json_trigrams_dict = convert_gram_dict_to_json(trigrams_dict)
    print ('\n\n---[ TRIGRAMS ]---\n\n')
    pprint (json_trigrams_dict)

在下面的源文本上使用上面的脚本,我得到以下输出:

    ---[ SOURCE TEXT ]---
b'A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing?not even particles and electromagnetic radiation such as light?can escape from inside it.[1] The theory of general relativity predicts that a sufficiently compact mass can deform spacetime to form a black hole.[2][3] The boundary of the region from which no escape is possible is called the event horizon. Although the event horizon has an enormous effect on the fate and circumstances of an object crossing it, no locally detectable features appear to be observed.[4] In many ways a black hole acts like an ideal black body, as it reflects no light.[5][6] Moreover, quantum field theory in curved spacetime predicts that event horizons emit Hawking radiation, with the same spectrum as a black body of a temperature inversely proportional to its mass. This temperature is on the order of billionths of a kelvin for black holes of stellar mass, making it essentially impossible to observe.\n\nObjects whose gravitational fields are too strong for light to escape were first considered in the 18th century by John Michell and Pierre-Simon Laplace.[7] The first modern solution of general relativity that would characterize a black hole was found by Karl Schwarzschild in 1916, although its interpretation as a region of space from which nothing can escape was first published by David Finkelstein in 1958. Black holes were long considered a mathematical curiosity; it was during the 1960s that theoretical work showed they were a generic prediction of general relativity. The discovery of neutron stars in the late 1960s sparked interest in gravitationally collapsed compact objects as a possible astrophysical reality.\n'


---[ BIGRAMS OUTPUT]---

('{"black hole": 4, "hole region": 1, "region spacetime": 1, "spacetime '
 'exhibiting": 1, "exhibiting strong": 1, "strong gravitational": 1, '
 '"gravitational effects": 1, "effects nothing": 1, "nothing even": 1, "even '
 'particles": 1, "particles electromagnetic": 1, "electromagnetic radiation": '
 '1, "radiation light": 1, "light escape": 2, "escape inside": 1, "inside 1": '
 '1, "1 theory": 1, "theory general": 1, "general relativity": 3, "relativity '
 'predicts": 1, "predicts sufficiently": 1, "sufficiently compact": 1, '
 '"compact mass": 1, "mass deform": 1, "deform spacetime": 1, "spacetime '
 'form": 1, "form black": 1, "hole 2": 1, "2 3": 1, "3 boundary": 1, "boundary '
 'region": 1, "region escape": 1, "escape possible": 1, "possible called": 1, '
 '"called event": 1, "event horizon": 2, "horizon although": 1, "although '
 'event": 1, "horizon enormous": 1, "enormous effect": 1, "effect fate": 1, '
 '"fate circumstances": 1, "circumstances object": 1, "object crossing": 1, '
 '"crossing locally": 1, "locally detectable": 1, "detectable features": 1, '
 '"features appear": 1, "appear observed": 1, "observed 4": 1, "4 many": 1, '
 '"many ways": 1, "ways black": 1, "hole acts": 1, "acts like": 1, "like '
 'ideal": 1, "ideal black": 1, "black body": 2, "body reflects": 1, "reflects '
 'light": 1, "light 5": 1, "5 6": 1, "6 moreover": 1, "moreover quantum": 1, '
 '"quantum field": 1, "field theory": 1, "theory curved": 1, "curved '
 'spacetime": 1, "spacetime predicts": 1, "predicts event": 1, "event '
 'horizons": 1, "horizons emit": 1, "emit hawking": 1, "hawking radiation": 1, '
 '"radiation spectrum": 1, "spectrum black": 1, "body temperature": 1, '
 '"temperature inversely": 1, "inversely proportional": 1, "proportional '
 'mass": 1, "mass temperature": 1, "temperature order": 1, "order billionths": '
 '1, "billionths kelvin": 1, "kelvin black": 1, "black holes": 2, "holes '
 'stellar": 1, "stellar mass": 1, "mass making": 1, "making essentially": 1, '
 '"essentially impossible": 1, "impossible observe": 1, "observe objects": 1, '
 '"objects whose": 1, "whose gravitational": 1, "gravitational fields": 1, '
 '"fields strong": 1, "strong light": 1, "escape first": 2, "first '
 'considered": 1, "considered 18th": 1, "18th century": 1, "century john": 1, '
 '"john michell": 1, "michell pierre": 1, "pierre simon": 1, "simon laplace": '
 '1, "laplace 7": 1, "7 first": 1, "first modern": 1, "modern solution": 1, '
 '"solution general": 1, "relativity would": 1, "would characterize": 1, '
 '"characterize black": 1, "hole found": 1, "found karl": 1, "karl '
 'schwarzschild": 1, "schwarzschild 1916": 1, "1916 although": 1, "although '
 'interpretation": 1, "interpretation region": 1, "region space": 1, "space '
 'nothing": 1, "nothing escape": 1, "first published": 1, "published david": '
 '1, "david finkelstein": 1, "finkelstein 1958": 1, "1958 black": 1, "holes '
 'long": 1, "long considered": 1, "considered mathematical": 1, "mathematical '
 'curiosity": 1, "curiosity 1960s": 1, "1960s theoretical": 1, "theoretical '
 'work": 1, "work showed": 1, "showed generic": 1, "generic prediction": 1, '
 '"prediction general": 1, "relativity discovery": 1, "discovery neutron": 1, '
 '"neutron stars": 1, "stars late": 1, "late 1960s": 1, "1960s sparked": 1, '
 '"sparked interest": 1, "interest gravitationally": 1, "gravitationally '
 'collapsed": 1, "collapsed compact": 1, "compact objects": 1, "objects '
 'possible": 1, "possible astrophysical": 1, "astrophysical reality": 1}')

---[ TRIGRAMS OUTPUT ]---

'{}'

我不明白为什么我不能运行脚本,所以同时输出了二元组和三元组。

提前感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

filter返回一个迭代器。一旦遍历它,它将变为空。如果要多次使用迭代器,则必须将其转换为列表:

important_words = list(filter(lambda x: x not in stopwords, words_lowercase))