这个问题是关于基于普通单词的文本分类,我不知道我是否正在解决问题 我在“描述”列中有一个文本很好的Excel,在“ ID”列中有一个唯一的ID,我想遍历描述并根据文本中常用词的百分比或频率对它们进行比较,我希望对描述进行分类并给他们另一个ID。请参见下面的示例...。
#importing pandas as pd
import pandas as pd
# creating a dataframe
df = pd.DataFrame({'ID': ['12 ', '54', '88','9'],
'Description': ['Staphylococcus aureus is a Gram-positive, round-shaped
bacterium that is a member of the Firmicutes', 'Streptococcus pneumoniae,
or pneumococcus, is a Gram-positive, alpha-hemolytic or beta-hemolytic',
'Dicyemida, also known as Rhombozoa, is a phylum of tiny parasites ','A
television set or television receiver, more commonly called a television,
TV, TV set, or telly']})
ID Description
12 Staphylococcus aureus is a Gram-positive, round-shaped bacterium that is a member of the Firmicutes
54 Streptococcus pneumoniae, or pneumococcus, is a Gram-positive, round-shaped bacterium that is a member beta-hemolytic
88 Dicyemida, also known as Rhombozoa, is a phylum of tiny parasites
9 A television set or television receiver, more commonly called a television, TV, TV set, or telly
例如12和54描述中有超过75%的常用词 将具有相同的ID。输出如下:
ID Description
12 Staphylococcus aureus is a Gram-positive, round-shaped bacterium that
is a member of the Firmicutes
12 Streptococcus pneumoniae, or pneumococcus, is a Gram-positive, round-
shaped bacterium that is a member beta-hemolytic
88 Dicyemida, also known as Rhombozoa, is a phylum of tiny parasites
9 A television set or television receiver, more commonly called a
television, TV, TV set, or telly
import codecs
import re
import copy
import collections
import pandas as pd
import numpy as np
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
import matplotlib.pyplot as plt
%matplotlib inline
from nltk.corpus import stopwords
# creating a dataframe 1
df = pd.DataFrame({'ID': ['12 '],
'Description': ['Staphylococcus aureus is a Gram-positive, round-shaped
bacterium that is a member of the Firmicutes']})
# creating a dataframe 2
df = pd.DataFrame({'ID': ['54'],
'Description': ['Streptococcus pneumoniae,
or pneumococcus, is a Gram-positive, alpha-hemolytic or beta-hemolytic']})
esw = stopwords.words('english')
word_pattern = re.compile("^\w+$")
def get_text_counter(text):
tokens = WordPunctTokenizer().tokenize(PorterStemmer().stem(text))
tokens = list(map(lambda x: x.lower(), tokens))
tokens = [token for token in tokens if re.match(word_pattern, token) and token not in esw]
return collections.Counter(tokens), len(tokens)
def make_df(counter, size):
abs_freq = np.array([el[1] for el in counter])
rel_freq = abs_freq / size
index = [el[0] for el in counter]
df = pd.DataFrame(data = np.array([abs_freq, rel_freq]).T, index=index, columns=['Absolute Frequency', 'Relative Frequency'])
df.index.name = 'Most_Common_Words'
return df
Risk1_counter, Risk1_size = get_text_counter(Risk1)
make_df(Risk1_counter.most_common(500), Risk1_size)
Risk2_counter, Risk2_size = get_text_counter(Risk2)
make_df(Risk2_counter.most_common(500), Risk2_size)
all_counter = Risk1_counter + Risk2_counter
all_df = make_df(Risk2_counter.most_common(1000), 1)
most_common_words = all_df.index.values
df_data = []
for word in most_common_words:
Risk1_c = Risk1_counter.get(word, 0) / Risk1_size
Risk2_c = Risk2_counter.get(word, 0) / Risk2_size
d = abs(Risk1_c - Risk2_c)
df_data.append([Risk1_c, Risk2_c, d])
dist_df= pd.DataFrame(data = df_data, index=most_common_words,
columns=['Risk1 Relative Freq', 'Risk2 Hight Relative Freq','Relative Freq Difference'])
dist_df.index.name = 'Most Common Words'
dist_df.sort_values('Relative Freq Difference', ascending = False, inplace=True)
答案 0 :(得分:2)
更好的方法可能是在NLP中使用句子相似度算法。一个很好的起点是使用Google的通用句子嵌入,如本Python notebook所示。如果经过预训练的Google USE无法正常工作,则还有其他句子嵌入(例如,从Facebook推论得出)。另一种选择是使用word2vec并对句子中每个单词获得的向量求平均值。