Question

我有两个CSV文件，都包含日期和时间列。对于EACH行，我需要同时匹配CSV 1和CSV 2中的时间和日期，并从CSV 2中提取天气。

CSV 1：

    Date           Time    Value
    2017/04/20     12:00   100
    2017/03/20     12:00   250
    2017/03/20     12:00   300
    2017/02/20     12:00   80
    2017/02/20     12:00   500

CSV 2：

    Date           Time    Weather
    2017/02/20     12:00   Sunny
    2017/02/20     12:00   Sunny
    2017/03/20     12:00   Sunny
    2017/03/20     12:00   Sunny
    2017/04/20     12:00   Sunny

我不知道它是否有效，但是我要做的第一件事是将CSV文件附加到两个python列表中：

list1 = []
list2 = []
for row in CSV1:
    list1.append(row)
for row in CSV2:
    list2.append(row)

然后，我为list1中的每一行获取日期和类型，并立即循环遍历list2中的每一行，直到元素匹配为止。

for row in list1:
    published_date = row[0]
    published_time = row[1]
    for rows in list2:
        if published_date == rows[0] and published_time == rows[1]:
            "do something with rows[2]"
            break

这可行，但是CSV1有1700行，而CSV2有1.000.000行，因此此过程需要150秒。有没有明显更快的方法？

我知道有些解决方案必须匹配一个元素，但是这里是2个，我无法调整单元素解决方案来工作。

我是Stack Overflow的新手，所以如果我在这篇文章中做错了什么，请通知我。

Answer 1

我建议您在Python中签出“ pandas”库。它可以帮助您解决效率问题。我很好奇并在Pandas中实现了该问题，并且能够在373毫秒内使用一些虚拟数据来完成问题。

您可以使用以下代码来评估框架的使用。

# Generating some random samples
import pandas as pd
date_range = pd.date_range(start='2017-04-20',periods=1700)
time_range = pd.to_datetime(pd.date_range('12:00', freq='1H', periods=1700).strftime('%H:%m'))
values = np.arange(0,1700)
weather = np.random.choice(['rain','sunny','windy'],size=1700, replace=True)

# Putting the Random Data into a DataFrame
df1 = pd.DataFrame({'Date':date_range,
                    'Time':time_range,
                   'Value':values})

df2 = pd.DataFrame({'Date':np.random.choice(date_range, size=1000000, replace=True),
                    'Time': np.random.choice(time_range, size=1000000, replace=True),
                   'Weather':np.random.choice(weather, size=1000000, replace=True)})

# Mergind the Datatogther on the Date and Time Columns
df3 = pd.merge(df1,df2,on=['Date','Time'], how='inner')
df3

Answer 2

列表在查找成员资格方面效率低下。在这种情况下，请改用dict，因为您还需要将日期和时间映射到天气。您可以将CSV2读入由日期和时间元组索引的字典：

weather_history = {tuple(row[:2]): row[2] for row in CSV2}

所以这将是一个像这样的字典：

{('2017/02/20', '12:00'): 'Sunny',  ('2017/02/20', '12:00'): 'Sunny', ... }

这样您便可以更有效地执行查找：

for row in list1:
    published_date, published_time = row[:2]
    if (published_date, published_time) in weather_history:
        # do something with weather_history[(published_date, published_time)], which is the weather of that date and time

Answer 3

您需要检查元组(row[0], row[1])是否已在另一个文件中看到。

最自然的数据结构是set。

首先，您需要在较小文件上循环以设置集合，然后在较大文件上循环以根据保存的数据检查其内容。

dates_times = {(items[0], items[1]) for items in (line.split() for lines in CSV1)}
for line in CSV2:
    items = line.split()
    if (items[0], items[1]) in dates_times:
        do_someting_with(items[2]

匹配多个CSV文件中的多个元素

3 个答案: