我在使用Dask时遇到此错误。我不知道如何解决它,因为在我的代码中似乎没有什么内容。
我正在做的是读取数据框并使用stanfordnlp
标记文本列,然后取出nouns
。仅使用pandas
时效果很好,但是使用dask
时出现此错误。我在Ubuntu
,python 3.7.3
和Dask 2.6.0
上。
这是我的错误:
回溯(最近通话最近): 文件main.py,第56行,位于main(df = data,nlp = nlp,lang = lang,wanted_pos = wp)
主df.persist()中的文件“ main.py”,第13行 文件“ /home/bertil/Envs/datascience/lib/python3.7/site-packages/dask/base.py”,第138行,以persist(结果)= persist(自身,遍历为False,** kwargs) 文件“ /home/bertil/Envs/datascience/lib/python3.7/site-packages/dask/base.py”,第629行,持久结果= schedule(dsk,keys,** kwargs)
在获得** kwargs的文件“ /home/bertil/Envs/datascience/lib/python3.7/site-packages/dask/threaded.py”的第80行中
在get_async raise_exception(exc,tb)中的第486行中的文件“ /home/bertil/Envs/datascience/lib/python3.7/site-packages/dask/local.py”
在316行中增加文件“ /home/bertil/Envs/datascience/lib/python3.7/site-packages/dask/local.py”
在execute_task结果= _execute_task(任务,数据)中的文件“ /home/bertil/Envs/datascience/lib/python3.7/site-packages/dask/local.py”,第222行
文件“ /home/bertil/Envs/datascience/lib/python3.7/site-packages/dask/core.py”,第118行,位于_execute_task args2 = [_execute_task(a,cache)for a in args]
args2中的文件“ /home/bertil/Envs/datascience/lib/python3.7/site-packages/dask/core.py”第118行= [_args中的_execute_task(a,cache)]
_execute_task return func(* args2)中的文件“ /home/bertil/Envs/datascience/lib/python3.7/site-packages/dask/core.py”,行119
TypeError:调用()接受2个位置参数,但给出了3个
这是我的代码:
#!/usr/bin/env python
from pathlib import Path
import dask.dataframe as dd
import stanfordnlp
import string
def main(df, nlp, lang, wanted_pos):
df['tagged'] = df['Message'].apply(process,
args=(nlp, lang, wanted_pos,),
meta=('Message', 'object'))
df.persist()
df.to_csv(f'output.csv')
def process(text, nlp, lang, wanted_pos):
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
token = nlp(text)
words = {word for sent in token.sentences for word in sent.words}
wanted_words = set(filter(lambda x: x in wanted_pos, words))
wanted_words = ','.join(word for word in wanted_words if word)
return wanted_words
if __name__ == '__main__':
# Choose language
lang = 'da'
# Chose wanted_pos
wp = ['NOUN']
# Read data in chunks
data = dd.read_csv('sample.csv', quoting=3, error_bad_lines=False,
dtype={'Message': 'object',
'Action Time': 'object',
'ClientQueues': 'object',
'Country': 'object',
'Custom Tags': 'object',
'Favorites': 'object',
'Geo Target': 'object',
'Location': 'object',
'State': 'object'})
# Download model for nlp.
stanford_path = Path.home() / 'stanfordnlp_resources' / f'{lang}_ddt_models'
if not stanford_path.exists():
stanfordnlp.download(lang)
# Set up nlp pipeline
nlp = stanfordnlp.Pipeline(processors='tokenize,lemma,pos', lang=lang)
main(df=data, nlp=nlp, lang=lang, wanted_pos=wp)
更新:我以为我已修复它,但是还没有,所以我再次删除了答案。我仍然遇到问题