熊猫:如何让算法更快

时间:2016-07-06 15:44:06

标签: python pandas

我有任务:我应该在大文件中找到一些数据并将这些数据添加到某个文件中。 我搜索数据的文件是22 million string,我使用chunksize进行划分。 在其他文件中,我有600 id of users列,我找到有关大文件中每个用户的信息。 第一个我将数据划分为间隔,然后搜索有关所有这些文件中每个用户的信息。 我使用timer知道,在df尺寸1 million string中花费多少时间写入文件和平均时间来查找信息并将其写入文件1.7 sec。在计算完程序的所有时间后,我得到6 hours。 (1.5 sec * 600 id * 22 interval)。 我想更快地做到这一点,但除chunksize之外我不知道。 我添加了我的代码

el = pd.read_csv('df2.csv', iterator=True, chunksize=1000000)
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
dates1 = buys['date']
ids1 = buys['id']
for i in el:
    i['used_at'] = pd.to_datetime(i['used_at'])
    df = i.sort_values(['ID', 'used_at'])
    dates = df['used_at']
    ids = df['ID']
    urls = df['url']
    for i, (id, date, url, id1, date1) in enumerate(zip(ids, dates, urls, ids1, dates1)):
        start = time.time()
        df1 = df[(df['ID'] == ids1[i]) & (df['used_at'] < (dates1[i] + dateutil.relativedelta.relativedelta(days=5)).replace(hour=0, minute=0, second=0)) & (df['used_at'] > (dates1[i] - dateutil.relativedelta.relativedelta(months=1)).replace(day=1, hour=0, minute=0, second=0))]
        df1 = DataFrame(df1)
        if df1.empty:
            continue
        else:
            with open('3.csv', 'a') as f:
                df1.to_csv(f, header=False)
                end = time.time()
                print(end - start)

1 个答案:

答案 0 :(得分:1)

您的代码中存在一些问题

  1. zip接受可能长度不同的参数

  2. dateutil.relativedelta可能与pandas Timestamp不兼容。 使用pandas 0.18.1和python 3.5,我得到了这个:

    now = pd.Timestamp.now()
    now
    Out[46]: Timestamp('2016-07-06 15:32:44.266720')
    now + dateutil.relativedelta.relativedelta(day=5)
    Out[47]: Timestamp('2016-07-05 15:32:44.266720')
    

    因此,最好使用pd.Timedelta

    now + pd.Timedelta(5, 'D')
    Out[48]: Timestamp('2016-07-11 15:32:44.266720')
    

    但这几个月有些不准确:

    now - pd.Timedelta(1, 'M')
    Out[49]: Timestamp('2016-06-06 05:03:38.266720')
    
  3. 这是代码草图。我没有测试,我可能错了你想要的东西。 关键部分是合并两个数据帧而不是逐行迭代。

    # 1) convert to datetime here 
    # 2) optionally, you can select only relevant cols with e.g. usecols=['ID', 'used_at', 'url']
    # 3) iterator is prob. superfluous
    el = pd.read_csv('df2.csv', chunksize=1000000, parse_dates=['used_at'])
    
    buys = pd.read_excel('smartphone.xlsx')
    buys['date'] = pd.to_datetime(buys['date'])
    # consider loading only relevant columns to buys
    
    # compute time intervals here (not in a loop!)
    buys['date_min'] = (buys['date'] - pd.TimeDelta(1, unit='M')
    buys['date_min'] = (buys['date'] + pd.TimeDelta(5, unit='D')
    
    # now replace (probably it needs to be done row by row)
    buys['date_min'] = buys['date_min'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
    buys['date_max'] = buys['date_max'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
    
    # not necessary
    # dates1 = buys['date']
    # ids1 = buys['id']
    
    for chunk in el:
        # already converted to datetime
        # i['used_at'] = pd.to_datetime(i['used_at'])
    
        # defer sorting until later
        # df = i.sort_values(['ID', 'used_at'])
    
        # merge!
        # (option how='inner' selects only rows that have the same id in both data frames; it's default)
        merged = pd.merge(chunk, buys, left_on='ID', right_on='id', how='inner')
        bool_idx = (merged['used_at'] < merged['date_max']) & (merged['used_at'] > merged['date_min'])
        selected = merged.loc[bool_idx]
    
        # probably don't need additional columns from buys, 
        # so either drop them or select the ones from chunk (beware of possible duplicates in names)
        selected = selected[chunk.columns]
    
        # sort now (possibly a smaller frame)
        selected = selected.sort_values(['ID', 'used_at'])
    
        if selected.empty:
            continue
        with open('3.csv', 'a') as f:
            selected.to_csv(f, header=False)
    

    希望这会有所帮助。请仔细检查代码并根据您的需求进行调整。

    请查看the docs以了解merge的选项。