我有两个文件列表,我使用以下方法从FTP文件夹中提取:
sFiles = ftp.nlst(date+'sales.csv')
oFiles = ftp.nlst(date+'orders.csv')
这导致两个列表看起来像:
sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']
oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']
使用我的真实数据集,类似......
for sales, orders in zip(sorted(sFiles),sorted(oFiles)):
df = pd.concat(...)
获取我想要的结果,但有时会出现问题并且两个文件都没有进入正确的FTP文件夹,所以我想要一些代码来创建一个可迭代的对象,我可以根据日期提取匹配的订单和销售文件名称。
以下作品......我不确定" pythonic"得分我给它。可读性差,但这是理解,所以我想象有性能提升?
[(sales, orders) for sales in sFiles for orders in oFiles if re.search(r'\d+',sales).group(0) == re.search(r'\d+',orders).group(0)]
答案 0 :(得分:3)
利用pandas DataFrame的索引:
OpenFilesEvent
所以:
import pandas as pd
sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']
oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']
s_dates = [pd.Timestamp.strptime(file[:8], '%Y%m%d') for file in sFiles]
s_df = pd.DataFrame({'sFiles': sFiles}, index=s_dates)
o_dates = [pd.Timestamp.strptime(file[:8], '%Y%m%d') for file in oFiles]
o_df = pd.DataFrame({'oFiles': oFiles}, index=o_dates)
df = s_df.join(o_df, how='outer')
答案 1 :(得分:2)
您可以使用字典:
import collections
d = collections.defaultdict(dict)
sFiles = ftp.nlst(date+'sales.csv')
oFiles = ftp.nlst(date+'orders.csv')
for sale, order in zip(sFiles, oFiles):
a, b = sale.split("_")
a1, b2 = order.split("_")
d[a]["sales"] = sale
d[a1]["orders"] = order
print(dict(d))
现在,您的数据格式为:{"date":{"sales":"sales filename", "orders":"orders filename"}}
输出:
{'20170828': {'sales': '20170828_sales.csv'}, '20170822': {'sales': '20170822_sales.csv', 'orders': '20170822_orders.csv'}, '20170823': {'orders': '20170823_orders.csv'}, '20170824': {'sales': '20170824_sales.csv', 'orders': '20170824_orders.csv'}, '20170825': {'sales': '20170825_sales.csv', 'orders': '20170825_orders.csv'}, '20170826': {'sales': '20170826_sales.csv', 'orders': '20170826_orders.csv'}, '20170827': {'sales': '20170827_sales.csv', 'orders': '20170827_orders.csv'}}
编辑:
通过字典理解并建立你提出的列表理解解决方案,你可以试试这个:
import re
final_data = [{"sold":sold, "order":order} for sold in sFiles for order in oFiles if re.findall("\d+", sold)[0] == re.findall("\d+", order)[0]]
输出:
[{'sold': '20170822_sales.csv', 'order': '20170822_orders.csv'}, {'sold': '20170824_sales.csv', 'order': '20170824_orders.csv'}, {'sold': '20170825_sales.csv', 'order': '20170825_orders.csv'}, {'sold': '20170826_sales.csv', 'order': '20170826_orders.csv'}, {'sold': '20170827_sales.csv', 'order': '20170827_orders.csv'}]
答案 2 :(得分:1)
仅仅因为理解存在并不意味着你应该将它们用于一切。这很好用:
date = re.compile(r'\d+')
for sales in sFiles:
salesDate = date.search(sales).group(0)
for orders in oFiles:
orderDate = date.search(orders).group(0)
if salesDate == orderDate:
print sales, orders
是否可以加快速度?是。但是你不需要强迫它进入列表理解只是因为你可以。有时编写更多代码会更好,只是因为它会将复杂性分散开来。
这是一个增量改进,使得算法O(n):
date = re.compile(r'\d+')
orders_dict = dict((date.search(file).group(0), file) for file in oFiles)
for sales in sFiles:
salesDate = date.search(sales).group(0)
if salesDate in orders_dict:
orders = orders_dict[salesDate]
print sales, orders
else:
# what do you do if it doesn't exist? You can't put handling code
# here if you try to write this as a comprehension.
答案 3 :(得分:1)
这将创建一个生成器,以日期顺序返回匹配的对:
from collections import defaultdict
def match(sales,orders):
# When a key is referenced for the first time, the value
# will default to the result of the lambda.
d = collections.defaultdict(lambda:[None,None])
# set sales files on the first entry in the value.
for sale in sFiles:
d[sale[:8]][0] = sale
# set orders files on the second entry.
for order in oFiles:
d[order[:8]][1] = order
for k in sorted(d):
# Both values need to exist.
# If you want the singles remove the if.
if all(v for v in d[k]):
yield d[k]
sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']
oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']
for s,o in match(sFiles,oFiles):
print(s,o)
输出:
20170822_sales.csv 20170822_orders.csv
20170824_sales.csv 20170824_orders.csv
20170825_sales.csv 20170825_orders.csv
20170826_sales.csv 20170826_orders.csv
20170827_sales.csv 20170827_orders.csv