我看到一个pandas dataframe
urls_df
,其中包含合并而成的部分,构成一个网址。基本上有2种方法可以组合这些组件并检索url的内容。 URLObject
将负责此部分,并以dict
格式从所生成的URL中返回Web内容和服务器内容。它是这样工作的:
list(df.apply(lambda row: getattr(URLObject(row[discriminant_col],
row['col_1'],
row['col_2']),
attribute),
axis=1)) #returns list of dicts
urls_df
包含2个判别列['URL_DISCRIMINANT_COLUMN_1','URL_DISCRIMINANT_COLUMN_2']
,它们稍微改变了网址以获取特定的文件夹。现在,对于每一行,我必须从上面运行getattr(URLObject...
,每个源运行4次,Web中有2个文件夹,服务器中有2个文件夹。挑战在于如何并行运行这些程序,如下所示。我正在做4 threads
,每列一次。
我遇到的问题是由于线程问题,数据有时可以正常显示,有时返回urls_df列的名称而不是内容。有人可以解释我做错了什么以及我可以做些什么来纠正此问题?
from multiprocessing.dummy import Pool as ThreadPool
def do_threaded(tasks):
with ThreadPool(len(tasks)) as pool:
results = [pool.apply_async(*t) for t in tasks]
results = [res.get() for res in results]
return results
def func(df, discriminant_col, arribute):
return list(df.apply(lambda row: getattr(URLObject(row[discriminant_col],
row['col_1'],
row['col_2']),
attribute),
axis=1))
urls_df = pd.DataFrame()# this is a pandas dataframe with data
tasks = [(func, (input_df, discriminant, atr)) for atr in [server_data, web_data] for discriminant in ['URL_DISCRIMINANT_COLUMN_1','URL_DISCRIMINANT_COLUMN_2']]
table = do_threaded(tasks)
headers = [header.pop(0) for header in table]
content_df = pd.DataFrame(table).
content_df.columns = headers