Merge CountVectorizer output from 4 text columns back into one dataset

时间:2016-10-20 18:49:23

标签: python pandas scikit-learn nlp

I have a collection of ~100,000 documents in a dataset with a unique doc_id and four columns containing text (like below).

original dataset

I want to vectorize each of the four text columns individually and then combine all of those features back together to create one large dataset for the purpose of building a model for prediction. I approached the vectorization for each text feature using code like below:

stopwords = nltk.corpus.stopwords.words("english")

subject_transformer = CountVectorizer(stop_words=stopwords)
subject_vectorized = subject_transformer.fit_transform(full_docs['subject'])

body_transformer = CountVectorizer(stop_words=stopwords)
body_vectorized = body_transformer.fit_transform(full_docs['body'])

excerpt_transformer = CountVectorizer(stop_words=stopwords)
excerpt_vectorized = excerpt_transformer.fit_transform(full_docs['excerpt'])

regex_transformer = CountVectorizer(stop_words=stopwords)
regex_vectorized = regex_transformer.fit_transform(full_docs['regex'])

Each vectorization yields a sparse matrix like below where column one is the document number, column two is the column number (one for each word in the original text column), and the last column is the actual count.

sparse matrix

What I want to do is the following:

  1. Transpose each sparse matrix to a full dataframe of dimensions nxp where n=number of documents & p=number of words in that corpus
  2. Merge each of these matrices/dataframes back together for the purpose of building a model for prediction

I initially tried the following:

regex_vectorized_df = pd.DataFrame(regex_vectorized.toarray())

Then I could merge the four individual dataframes back together. This doesn't work because toarray() is too memory intensive. What is the best way to merge these four sparse matrices into one dataset with one unique line per document?

0 个答案:

没有答案