并行处理大量任务

时间:2015-05-07 19:11:03

标签: python multiprocessing

我有10,000个csv文件,我必须在Pandas中打开并使用一些Pandas的函数进行操作/转换,并将新输出保存到csv。我可以使用并行进程(对于Windows)来加快工作速度吗?我试过以下但没有运气:

import pandas pd
import multiprocessing

def proc_file(file):
    df = pd.read_csv(file)
    df = df.reample('1S', how='sum')
    df.to_csv('C:\\newfile.csv')
if __name__ == '__main__':    
    files = ['C:\\file1.csv', ... 'C:\\file2.csv']

    for i in files:
        p = multiprocessing.Process(target=proc_file(i))
    p.start() 

我不认为我对Python中的多处理有很好的理解。

2 个答案:

答案 0 :(得分:1)

也许是这样的:

p = multiprocessing.Pool()
p.map(prof_file, files)

对于这个大小,您确实需要一个进程池,因此启动进程的成本会被它所做的工作所抵消。 multiprocessing.Pool正是这样做的:它将任务并行性(这就是你正在做的事情)转换为task parallelism

答案 1 :(得分:1)

请务必稍后关闭游泳池:

$query = $db->prepare("SELECT Username FROM Users WHERE Rank = 'Partner'");
$query->execute();
while ($row = $query->fetch(PDO::FETCH_ASSOC)){
  $channel = $row['Username'];



function findviews($channel) {

    error_reporting(E_ALL ^ E_NOTICE);

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
    curl_setopt($ch, CURLOPT_URL, 'http://socialblade.com/youtube/user/' . $channel);

    $gdatapage = curl_exec($ch);

    $gdatapage = strip_tags($gdatapage);
    $getviews = explode("Views for the Last 30 Days:",$gdatapage);
    $getviews = preg_replace("/\([^)]+\)/","",$getviews[1]);
    $getviews = str_replace(",", "", trim($getviews));
    $getviews = explode(" S",$getviews);
    $getviews = str_replace(" ", "", trim($getviews[0]));

    curl_close($ch);

    return $getviews;

}

$views = findviews($channel);

  $ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
curl_setopt($ch, CURLOPT_URL, 'http://gdata.youtube.com/feeds/api/users/' . $channel);

$gdatapage = curl_exec($ch);

preg_match("/subscriberCount=\'([^\']*)\'/", $gdatapage, $subscribers);

curl_close($ch);
$subs = $subscribers[1];


$query = $db->prepare("UPDATE Users SET `Views` = :views, `Subs` = :subs WHERE `Username` = :channel");
$query->bindParam(':views', $views);
$query->bindParam(':subs', $subs);
$query->bindParam(':channel', $channel);
$query->execute();

}

list_files可以包含一个列表,例如你可以从func()

返回改变的csv的名字