下面的代码行是gram_result将使用此数据帧的地方,该数据帧由数据集的两个类(“ 7”和“其他”)组成
df = pd.read_csv(output.txt, sep="|")
以下是我为Ngrams尝试的代码:
def gram_result():
# number of unigram/bigram to print out
'''N = no.of unigram, bigram or trigram to print out'''
'''Up to you how many no of unigram, bigram or trigram you want to print out'''
N = 5
for cat_id in df["cat_id"].sort_values().drop_duplicates():
features_chi2 = chi2(features, labels == cat_id)
indices = np.argsort(features_chi2[0])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
trigrams = [v for v in feature_names if len(v.split(' ')) == 3]
print("# '{}':".format(cat_id))
print(" . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-N:])))
print(" . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-N:])))
print(" . Most correlated trigrams:\n. {}".format('\n. '.join(trigrams[-N:])))
unigram_dict = {}
bigram_dict = {}
trigram_dict = {}
full_dict = {}
unigrams_array = []
bigram_array = []
trigram_array = []
'''Add keywords into a dictionary'''
for item in unigrams[-N:]:
unigrams_array.append(item)
unigram_dict[str(cat_id)] = unigrams_array
for item_bigram in bigrams[-N:]:
bigram_array.append(item_bigram)
for item_trigram in trigrams[-N:]:
trigram_array.append(item_trigram)
full_dict[cat_id] = (trigram_array, bigram_array, unigrams_array)
print(full_dict[cat_id])
print()
gram_result方法的输出为:
# '7':
. Most correlated unigrams:
. immobile
. junl
. tear
. uncommunicative
. indian
. Most correlated bigrams:
. air afebrile
. water hydration
. ptot keyed
. response plans
. indian nd
. Most correlated trigrams:
. reassessment indicated immobile
. water swallow test
. total stay standard
. room air afebrile
. iv cannula insitu
# 'others':
. Most correlated unigrams:
. immobile
. junl
. tear
. uncommunicative
. indian
. Most correlated bigrams:
. air afebrile
. water hydration
. ptot keyed
. response plans
. indian nd
. Most correlated trigrams:
. reassessment indicated immobile
. water swallow test
. total stay standard
. room air afebrile
. iv cannula insitu
两个类(“ 7”和“其他”)具有两个不同的数据集。 但是,我很好奇为什么两个具有不同数据集的不同类会打印出相同的ngrams关键字... 请帮我看看gram_result()的代码,并告诉我是否可以改进该方法...