Question

我有一个包含9列和89K行的数据帧。

我需要用2个字符串列执行TF-IDF。


EMP-名称

文本

然而，最终结果仅包括来自Tf-idf计算的数值。

我需要将原始列（emp-name和notes）与基于emp-name的分组和Tf-idf的结果合并。

ed_name               text
Pushan Mahapatra      meeting done with XYZ
Monalisa Biswas       pqr   app
Monalisa Biswas       Motor insurance
Monalisa Biswas       app installation
Monalisa Biswas       credit customers
Monalisa Biswas       Secure app installation
Amit Verma            meeting with chief
Amit Verma            Meeting with customer
Amit Verma            Meeting With SP for Business
Amit Verma            meeting 
Amit Verma            meeting done

代码

import pandas as pd

##### Read the CSV file ############
df = pd.read_csv("C:\\Users\\a\\Desktop\\b\\aaa.csv", encoding="ISO-8859-1") 

df['text'] = df.text.str.lower()
df.head()

#### Group-by emp_name and count the text ##########
counts = df.groupby('emp_name')\
  .bn_note_text.value_counts()\
  .to_frame()\
  .rename(columns={'text':'n_w'})
counts.head()

###### word sum #########
word_sum = counts.groupby(level=0)\
    .sum()\
    .rename(columns={'n_w': 'n_d'})
word_sum

###### TF calculation ##########
tf = counts.join(word_sum)
tf['tf'] = tf.n_w/tf.n_d
tf.head()

###### Idf calc #############
idf = df.groupby('text')\
  .ed_name\
  .nunique()\
  .to_frame()\
  .rename(columns={'emp_name':'i_d'})\
  .sort_values('i_d')
idf.head()

idf['idf'] = np.log(c_d/idf.i_d.values)
idf.head()

tf_idf = tf.join(idf)
tf_idf.head()

########### Tf -idf calc ###########
tf_idf['tf_idf'] = tf_idf.tf * tf_idf.idf
tf_idf.head()
print(tf_idf)

数据类型：

tf_idf.dtypes
Out[199]: 
n_w         int64
n_d         int64
tf        float64
i_d         int64
idf       float64
tf_idf    float64
dtype: object


df.dtypes
Out[200]: 
emp_name       object
text          object

dtype: object

如何合并来自＆＃34; Group-By＆＃34;的结果与Pandas中的原始数据框架

0 个答案: