从Panda到PySpark的Python:如何在PySpark中标记化,删除权宜之词以及做三字

时间:2018-12-27 19:33:09

标签: python pandas pyspark nlp nltk

我在下面有以下示例数据框。我在Jupyter Notebook中执行Python Pandas代码。

No  category    problem_definition
175 2521       ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438       ['galley', 'work', 'table', 'stuck']
912 2698       ['cloth', 'stuck']
572 2521       ['stuck', 'coffee']

我使用下面的代码标记了我的文本列:

from nltk.tokenize import sent_tokenize, word_tokenize 
import pandas as pd 
import re 

df['problem_definition_tokenized'] = df['problem_definition'].apply(word_tokenize)

我使用下面的代码删除权宜之词:

set(stopwords.words('english'))

stop_words = set(stopwords.words('english'))

df['problem_definition_stopwords'] = df['problem_definition_tokenized'].apply(lambda x: [i for i in x if i not in stop_words]) 

接下来,我使用搭配包计算了三字组。

import nltk
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

finder = BigramCollocationFinder.from_documents(df['problem_definition_stopwords'])

finder.apply_freq_filter(8) 

finder.nbest(trigram_measures.pmi, 100) 

s = pd.Series(df['problem_definition_stopwords'])

from nltk import ngrams
from collections import Counter

ngram_list = [pair for row in s for pair in ngrams(row, 3)]

counts = Counter(ngram_list).most_common()

df = pd.DataFrame.from_records(counts, columns=['gram', 'count'])

df

结果看起来像这样……“ xxx”代表一个单词

gram               count 
(xxx, xxx, xxx)    23
(xxx, xxx, xxx)    14
(xxx, xxx, xxx)    63
(xxx, xxx, xxx)    28

我可以使以上所有代码在Pandas Python中运行,但是当我尝试在PySpark环境中运行此代码时,它只会不断旋转。

有没有办法将我编写的代码转换为PySpark代码?我到处搜索,但找不到任何确定的信息。

0 个答案:

没有答案