我在下面有以下示例数据框。我在Jupyter Notebook中执行Python Pandas代码。
No category problem_definition
175 2521 ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438 ['galley', 'work', 'table', 'stuck']
912 2698 ['cloth', 'stuck']
572 2521 ['stuck', 'coffee']
我使用下面的代码标记了我的文本列:
from nltk.tokenize import sent_tokenize, word_tokenize
import pandas as pd
import re
df['problem_definition_tokenized'] = df['problem_definition'].apply(word_tokenize)
我使用下面的代码删除权宜之词:
set(stopwords.words('english'))
stop_words = set(stopwords.words('english'))
df['problem_definition_stopwords'] = df['problem_definition_tokenized'].apply(lambda x: [i for i in x if i not in stop_words])
接下来,我使用搭配包计算了三字组。
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_documents(df['problem_definition_stopwords'])
finder.apply_freq_filter(8)
finder.nbest(trigram_measures.pmi, 100)
s = pd.Series(df['problem_definition_stopwords'])
from nltk import ngrams
from collections import Counter
ngram_list = [pair for row in s for pair in ngrams(row, 3)]
counts = Counter(ngram_list).most_common()
df = pd.DataFrame.from_records(counts, columns=['gram', 'count'])
df
结果看起来像这样……“ xxx”代表一个单词
gram count
(xxx, xxx, xxx) 23
(xxx, xxx, xxx) 14
(xxx, xxx, xxx) 63
(xxx, xxx, xxx) 28
我可以使以上所有代码在Pandas Python中运行,但是当我尝试在PySpark环境中运行此代码时,它只会不断旋转。
有没有办法将我编写的代码转换为PySpark代码?我到处搜索,但找不到任何确定的信息。