如何改善Ngrams的结果?

时间:2018-11-04 03:06:58

标签: python n-gram information-extraction

下面的代码行是gram_result将使用此数据帧的地方,该数据帧由数据集的两个类(“ 7”和“其他”)组成

df = pd.read_csv(output.txt, sep="|")

以下是我为Ngrams尝试的代码:

   def gram_result():
    # number of unigram/bigram to print out
    '''N = no.of unigram, bigram or trigram to print out'''
    '''Up to you how many no of unigram, bigram or trigram you want to print out'''
    N = 5
    for cat_id in df["cat_id"].sort_values().drop_duplicates():
        features_chi2 = chi2(features, labels == cat_id)
        indices = np.argsort(features_chi2[0])
        feature_names = np.array(tfidf.get_feature_names())[indices]

        unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
        bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
        trigrams = [v for v in feature_names if len(v.split(' ')) == 3]

        print("# '{}':".format(cat_id))
        print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-N:])))
        print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-N:])))
        print("  . Most correlated trigrams:\n. {}".format('\n. '.join(trigrams[-N:])))

        unigram_dict = {}
        bigram_dict = {}
        trigram_dict = {}
        full_dict = {}

        unigrams_array = []
        bigram_array = []
        trigram_array = []

        '''Add keywords into a dictionary'''
        for item in unigrams[-N:]:
            unigrams_array.append(item)
        unigram_dict[str(cat_id)] = unigrams_array

        for item_bigram in bigrams[-N:]:
            bigram_array.append(item_bigram)

        for item_trigram in trigrams[-N:]:
            trigram_array.append(item_trigram)

        full_dict[cat_id] = (trigram_array, bigram_array, unigrams_array)

        print(full_dict[cat_id])

        print()

gram_result方法的输出为:

# '7':
  . Most correlated unigrams:
. immobile
. junl
. tear
. uncommunicative
. indian
  . Most correlated bigrams:
. air afebrile
. water hydration
. ptot keyed
. response plans
. indian nd
  . Most correlated trigrams:
. reassessment indicated immobile
. water swallow test
. total stay standard
. room air afebrile
. iv cannula insitu

# 'others':
  . Most correlated unigrams:
. immobile
. junl
. tear
. uncommunicative
. indian
  . Most correlated bigrams:
. air afebrile
. water hydration
. ptot keyed
. response plans
. indian nd
  . Most correlated trigrams:
. reassessment indicated immobile
. water swallow test
. total stay standard
. room air afebrile
. iv cannula insitu

两个类(“ 7”和“其他”)具有两个不同的数据集。 但是,我很好奇为什么两个具有不同数据集的不同类会打印出相同的ngrams关键字... 请帮我看看gram_result()的代码,并告诉我是否可以改进该方法...

0 个答案:

没有答案