根据字符串中的匹配日期将两个列表压缩在一起

时间:2017-09-19 00:50:18

标签: python

我有两个文件列表,我使用以下方法从FTP文件夹中提取:

sFiles = ftp.nlst(date+'sales.csv')
oFiles = ftp.nlst(date+'orders.csv')

这导致两个列表看起来像:

sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']

oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']

使用我的真实数据集,类似......

for sales, orders in zip(sorted(sFiles),sorted(oFiles)): 
     df = pd.concat(...)

获取我想要的结果,但有时会出现问题并且两个文件都没有进入正确的FTP文件夹,所以我想要一些代码来创建一个可迭代的对象,我可以根据日期提取匹配的订单和销售文件名称。

以下作品......我不确定" pythonic"得分我给它。可读性差,但这是理解,所以我想象有性能提升?

[(sales, orders) for sales in sFiles for orders in oFiles if re.search(r'\d+',sales).group(0) == re.search(r'\d+',orders).group(0)]

4 个答案:

答案 0 :(得分:3)

利用pandas DataFrame的索引:

OpenFilesEvent

所以:

import pandas as pd
sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']
oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']

s_dates = [pd.Timestamp.strptime(file[:8], '%Y%m%d') for file in sFiles]
s_df = pd.DataFrame({'sFiles': sFiles}, index=s_dates)

o_dates = [pd.Timestamp.strptime(file[:8], '%Y%m%d') for file in oFiles]
o_df = pd.DataFrame({'oFiles': oFiles}, index=o_dates)

df = s_df.join(o_df, how='outer')

答案 1 :(得分:2)

您可以使用字典:

import collections
d = collections.defaultdict(dict)

sFiles = ftp.nlst(date+'sales.csv')
oFiles = ftp.nlst(date+'orders.csv')
for sale, order in zip(sFiles, oFiles):
    a, b = sale.split("_")
    a1, b2 = order.split("_")
    d[a]["sales"] = sale
    d[a1]["orders"] = order
print(dict(d))

现在,您的数据格式为:{"date":{"sales":"sales filename", "orders":"orders filename"}}

输出:

{'20170828': {'sales': '20170828_sales.csv'}, '20170822': {'sales': '20170822_sales.csv', 'orders': '20170822_orders.csv'}, '20170823': {'orders': '20170823_orders.csv'}, '20170824': {'sales': '20170824_sales.csv', 'orders': '20170824_orders.csv'}, '20170825': {'sales': '20170825_sales.csv', 'orders': '20170825_orders.csv'}, '20170826': {'sales': '20170826_sales.csv', 'orders': '20170826_orders.csv'}, '20170827': {'sales': '20170827_sales.csv', 'orders': '20170827_orders.csv'}}

编辑:

通过字典理解并建立你提出的列表理解解决方案,你可以试试这个:

import re
final_data = [{"sold":sold, "order":order} for sold in sFiles for order in oFiles if re.findall("\d+", sold)[0] == re.findall("\d+", order)[0]]

输出:

[{'sold': '20170822_sales.csv', 'order': '20170822_orders.csv'}, {'sold': '20170824_sales.csv', 'order': '20170824_orders.csv'}, {'sold': '20170825_sales.csv', 'order': '20170825_orders.csv'}, {'sold': '20170826_sales.csv', 'order': '20170826_orders.csv'}, {'sold': '20170827_sales.csv', 'order': '20170827_orders.csv'}]

答案 2 :(得分:1)

仅仅因为理解存在并不意味着你应该将它们用于一切。这很好用:

date = re.compile(r'\d+')
for sales in sFiles:
    salesDate = date.search(sales).group(0)
    for orders in oFiles:
        orderDate = date.search(orders).group(0)
        if salesDate == orderDate:
            print sales, orders

是否可以加快速度?是。但是你不需要强迫它进入列表理解只是因为你可以。有时编写更多代码会更好,只是因为它会将复杂性分散开来。

这是一个增量改进,使得算法O(n):

date = re.compile(r'\d+')
orders_dict = dict((date.search(file).group(0), file) for file in oFiles)

for sales in sFiles:
    salesDate = date.search(sales).group(0)
    if salesDate in orders_dict:
        orders = orders_dict[salesDate]
        print sales, orders
    else:
        # what do you do if it doesn't exist? You can't put handling code
        # here if you try to write this as a comprehension.

答案 3 :(得分:1)

这将创建一个生成器,以日期顺序返回匹配的对:

from collections import defaultdict

def match(sales,orders):
    # When a key is referenced for the first time, the value
    # will default to the result of the lambda.
    d = collections.defaultdict(lambda:[None,None])

    # set sales files on the first entry in the value.
    for sale in sFiles:
        d[sale[:8]][0] = sale
    # set orders files on the second entry.
    for order in oFiles:
        d[order[:8]][1] = order

    for k in sorted(d):
        # Both values need to exist.
        # If you want the singles remove the if.
        if all(v for v in d[k]):
            yield d[k]

sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']
oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']

for s,o in match(sFiles,oFiles):
    print(s,o)

输出:

20170822_sales.csv 20170822_orders.csv
20170824_sales.csv 20170824_orders.csv
20170825_sales.csv 20170825_orders.csv
20170826_sales.csv 20170826_orders.csv
20170827_sales.csv 20170827_orders.csv