我有一个包含9列和89K行的数据帧。
我需要用2个字符串列执行TF-IDF。
- EMP-名称
- 文本
醇>
然而,最终结果仅包括来自Tf-idf计算的数值。
我需要将原始列(emp-name和notes)与基于emp-name的分组和Tf-idf的结果合并。
ed_name text
Pushan Mahapatra meeting done with XYZ
Monalisa Biswas pqr app
Monalisa Biswas Motor insurance
Monalisa Biswas app installation
Monalisa Biswas credit customers
Monalisa Biswas Secure app installation
Amit Verma meeting with chief
Amit Verma Meeting with customer
Amit Verma Meeting With SP for Business
Amit Verma meeting
Amit Verma meeting done
代码
import pandas as pd
##### Read the CSV file ############
df = pd.read_csv("C:\\Users\\a\\Desktop\\b\\aaa.csv", encoding="ISO-8859-1")
df['text'] = df.text.str.lower()
df.head()
#### Group-by emp_name and count the text ##########
counts = df.groupby('emp_name')\
.bn_note_text.value_counts()\
.to_frame()\
.rename(columns={'text':'n_w'})
counts.head()
###### word sum #########
word_sum = counts.groupby(level=0)\
.sum()\
.rename(columns={'n_w': 'n_d'})
word_sum
###### TF calculation ##########
tf = counts.join(word_sum)
tf['tf'] = tf.n_w/tf.n_d
tf.head()
###### Idf calc #############
idf = df.groupby('text')\
.ed_name\
.nunique()\
.to_frame()\
.rename(columns={'emp_name':'i_d'})\
.sort_values('i_d')
idf.head()
idf['idf'] = np.log(c_d/idf.i_d.values)
idf.head()
tf_idf = tf.join(idf)
tf_idf.head()
########### Tf -idf calc ###########
tf_idf['tf_idf'] = tf_idf.tf * tf_idf.idf
tf_idf.head()
print(tf_idf)
数据类型:
tf_idf.dtypes
Out[199]:
n_w int64
n_d int64
tf float64
i_d int64
idf float64
tf_idf float64
dtype: object
df.dtypes
Out[200]:
emp_name object
text object
dtype: object