dask-AttributeError:“系列”对象没有属性“拆分”

时间:2019-03-26 02:30:21

标签: python dask

我有超过800万行文本,这些行中我想删除所有停用词并使用dask.map_partitions()将文本定形,但出现以下错误:

AttributeError: 'Series' object has no attribute 'split'

是否可以将函数应用于数据集?

感谢您的帮助。

import pandas as pd
import dask.dataframe as dd
from spacy.lang.en import stop_words

cachedStopWords = list(stop_words.STOP_WORDS)

def stopwords_lemmatizing(text):
    return [word for word in text.split() if word not in cachedStopWords]

text = 'any length of text'
data = [{'content': text}]
df = pd.DataFrame(data, index=[0])
ddf = dd.from_pandas(df, npartitions=1)

ddf['content'] = ddf['content'].map_partitions(stopwords_lemmatizing, meta='f8')

1 个答案:

答案 0 :(得分:1)

顾名思义,

map_partitions适用于整个dask数据框的每个分区,每个分区都是pandas数据框(http://docs.dask.org/en/latest/dataframe.html#design)。您的函数对于seriesq逐个值,所以您真正想要的是简单的map

ddf['content'] = ddf['content'].map(stopwords_lemmatizing)

(如果您想在此处提供元,则它应该是零长度的Series而不是数据帧,例如meta=pd.Series(dtype='O')。)