Question

这个问题是关于基于普通单词的文本分类，我不知道我是否正在解决问题我在“描述”列中有一个文本很好的Excel，在“ ID”列中有一个唯一的ID，我想遍历描述并根据文本中常用词的百分比或频率对它们进行比较，我希望对描述进行分类并给他们另一个ID。请参见下面的示例...。

    #importing pandas as pd 
    import pandas as pd 

     # creating a dataframe 
     df = pd.DataFrame({'ID': ['12 ', '54', '88','9'], 
    'Description': ['Staphylococcus aureus is a Gram-positive, round-shaped 
     bacterium that is a member of the Firmicutes', 'Streptococcus pneumoniae, 
    or pneumococcus, is a Gram-positive, alpha-hemolytic or beta-hemolytic', 
    'Dicyemida, also known as Rhombozoa, is a phylum of tiny parasites ','A 
    television set or television receiver, more commonly called a television, 
    TV, TV set, or telly']})

ID     Description
12  Staphylococcus aureus is a Gram-positive, round-shaped bacterium that is a member of the Firmicutes
54  Streptococcus pneumoniae, or pneumococcus, is a Gram-positive, round-shaped bacterium that is a member beta-hemolytic
88  Dicyemida, also known as Rhombozoa, is a phylum of tiny parasites
9   A television set or television receiver, more commonly called a television, TV, TV set, or telly

例如12和54描述中有超过75％的常用词将具有相同的ID。输出如下：

ID     Description
12  Staphylococcus aureus is a Gram-positive, round-shaped bacterium that 
is a member of the Firmicutes
12  Streptococcus pneumoniae, or pneumococcus, is a Gram-positive, round- 
shaped bacterium that is a member beta-hemolytic
88  Dicyemida, also known as Rhombozoa, is a phylum of tiny parasites
9   A television set or television receiver, more commonly called a 
television, TV, TV set, or telly

这是我尝试过的，我使用了两个不同的数据框Risk1和Risk2，我没有遍历需要做的所有行：

import codecs
import re
import copy
import collections
import pandas as pd
import numpy as np
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
import matplotlib.pyplot as plt

%matplotlib inline

nltk.download('stopwords')

from nltk.corpus import stopwords

# creating a dataframe 1
 df = pd.DataFrame({'ID': ['12 '], 
'Description': ['Staphylococcus aureus is a Gram-positive, round-shaped 
 bacterium that is a member of the Firmicutes']})
# creating a dataframe 2
 df = pd.DataFrame({'ID': ['54'], 
'Description': ['Streptococcus pneumoniae, 
or pneumococcus, is a Gram-positive, alpha-hemolytic or beta-hemolytic']})

esw = stopwords.words('english')
esw.append('would')

word_pattern = re.compile("^\w+$")

def get_text_counter(text):
    tokens = WordPunctTokenizer().tokenize(PorterStemmer().stem(text))
    tokens = list(map(lambda x: x.lower(), tokens))
    tokens = [token for token in tokens if re.match(word_pattern, token) and token not in esw]
return collections.Counter(tokens), len(tokens)

def make_df(counter, size):
    abs_freq = np.array([el[1] for el in counter])
    rel_freq = abs_freq / size
    index = [el[0] for el in counter]
    df = pd.DataFrame(data = np.array([abs_freq, rel_freq]).T, index=index, columns=['Absolute Frequency', 'Relative Frequency'])
    df.index.name = 'Most_Common_Words'
return df

Risk1_counter, Risk1_size = get_text_counter(Risk1)
make_df(Risk1_counter.most_common(500), Risk1_size)

Risk2_counter, Risk2_size = get_text_counter(Risk2)
make_df(Risk2_counter.most_common(500), Risk2_size)

all_counter = Risk1_counter + Risk2_counter
all_df = make_df(Risk2_counter.most_common(1000), 1)
most_common_words = all_df.index.values


df_data = []
for word in most_common_words:
    Risk1_c = Risk1_counter.get(word, 0) / Risk1_size
    Risk2_c = Risk2_counter.get(word, 0) / Risk2_size
    d = abs(Risk1_c - Risk2_c)
    df_data.append([Risk1_c, Risk2_c, d])
dist_df= pd.DataFrame(data = df_data, index=most_common_words,
                    columns=['Risk1 Relative Freq', 'Risk2 Hight Relative Freq','Relative Freq Difference'])
dist_df.index.name = 'Most Common Words'
dist_df.sort_values('Relative Freq Difference', ascending = False, inplace=True)


dist_df.head(500)

Answer 1

更好的方法可能是在NLP中使用句子相似度算法。一个很好的起点是使用Google的通用句子嵌入，如本Python notebook所示。如果经过预训练的Google USE无法正常工作，则还有其他句子嵌入（例如，从Facebook推论得出）。另一种选择是使用word2vec并对句子中每个单词获得的向量求平均值。

您要查找句子嵌入之间的余弦相似度，然后重新标记相似度高于某个阈值（例如0.8）的类别。您将不得不尝试不同的相似性阈值以获得最佳匹配性能。

如何根据常用词对文本进行分类

1 个答案: