Question

我有一个pandas数据帧，其中包含时间戳和全球位置的雷击记录，格式如下：

Index      Date      Time                        Lat      Lon         Good fix?
0          1         20160101  00:00:00.9962692  -7.1961  -60.7604    1
1          2         20160101  00:00:01.0646207  -7.0518  -60.6911    1
2          3         20160101  00:00:01.1102066 -25.3913  -57.2922    1
3          4         20160101  00:00:01.2018573  -7.4842  -60.5129    1
4          5         20160101  00:00:01.2942750  -7.3939  -60.4992    1
5          6         20160101  00:00:01.4431493  -9.6386  -62.8448    1
6          8         20160101  00:00:01.5226157 -23.7089  -58.8888    1
7          9         20160101  00:00:01.5932412  -6.3513  -55.6545    1
8          10        20160101  00:00:01.6736350 -23.8019  -58.9382    1
9          11        20160101  00:00:01.6957858 -24.5724  -57.7229    1

实际数据框包含数百万行。我希望将在空间和时间上发生的事件与其他事件分开，并将它们存储在新的数据框isolated_fixes中。我编写了代码来计算任意两个事件的分离，如下所示：

def are_strikes_space_close(strike1,strike2,defclose=100,latpos=3,lonpos=4): #Uses haversine formula to calculate distance between points, returning a tuple with Boolean closeness statement, and numerical distance
    radlat1 = m.radians(strike1[1][latpos])
    radlon1 = m.radians(strike1[1][lonpos])
    radlat2 = m.radians(strike2[1][latpos])
    radlon2 = m.radians(strike2[1][lonpos])

    a=(m.sin((radlat1-radlat2)/2)**2) + m.cos(radlat1)*m.cos(radlat2)*(m.sin((radlon1-radlon2)/2)**2)
    c=2*m.atan2((a**0.5),((1-a)**0.5))
    R=6371 #earth radius in km
    d=R*c #distance between points in km
    if d <= defclose:
        return (True,d)
    else:
        return (False,d)

和时间：

def getdatetime(series,timelabel=2,datelabel=1,timeformat="%X.%f",dateformat="%Y%m%d"):
    time = dt.datetime.strptime(series[1][timelabel][:15], timeformat)
    date = dt.datetime.strptime(str(series[1][datelabel]), dateformat)
    datetime = dt.datetime.combine(date.date(),time.time())
    return datetime


def are_strikes_time_close(strike1,strike2,defclose=dt.timedelta(0,7200,0)):
    dt1=getdatetime(strike1)
    dt2=getdatetime(strike2)
    timediff=abs(dt1-dt2)
    if timediff<=defclose:
        return(True, timediff)
    else:
        return(False, timediff)

真正的问题是如何有效地将所有事件与所有其他事件进行比较，以确定其中有多少是space_close和time_close。

请注意，并非所有事件都需要进行检查，因为它们是根据日期时间进行排序的，因此如果有办法检查事件并且中途退出＆＃39;然后当事件不再及时关闭时停止，这将节省大量操作，但我不知道如何做到这一点。

目前，我的（非功能性）尝试看起来像这样：

def extrisolfixes(data,filtereddata,defisol=4): 
    for strike1 in data.iterrows():
        near_strikes=-1 #-1 to account for self counting once on each loop
        for strike2 in data.iterrows():
            if are_strikes_space_close(strike1,strike2)[0]==True and are_strikes_time_close(strike1,strike2)[0]==True:
                near_strikes+=1
        if near_strikes<=defisol:
            filtereddata=filtereddata.append(strike1)

感谢您的帮助！如果需要，我很乐意提供澄清。

Answer 1

这个答案可能效率不高。我正面临着一个非常类似的问题，目前我正在寻找比我工作更高效的东西，因为在我的数据帧（600k行）上计算仍需要一个小时。

我首先建议你甚至不要考虑像你一样使用for循环。您可能无法避免（我使用apply执行此操作），但第二个可以（必须）进行矢量化。

这种技术的想法是在数据框中创建一个新列，存储附近是否有另一次攻击（临时和空间）。

首先让我们创建一个函数计算（使用numpy包）一次打击（reference）与所有其他打击之间的距离：

def get_distance(reference,other_strikes):

    radius = 6371.00085 #radius of the earth
    # Get lats and longs in radians, then compute deltas:
    lat1 = np.radians(other_strikes.Lat)
    lat2 = np.radians(reference[0])
    dLat = lat2-lat1
    dLon = np.radians(reference[1]) - np.radians(other_strikes.Lon)
    # And compute the distance (in km)
    a = np.sin(dLat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dLon / 2.0) ** 2
    return 2 * np.arcsin(np.minimum(1, np.sqrt(a))) * radius

然后创建一个函数，检查对于一次给定的攻击，是否至少有另一个附近：

def is_there_a_strike_nearby(date_ref, lat_ref, long_ref, delta_t, delta_d, other_strikes):
    dmin = date_ref - np.timedelta64(delta_t,'D')
    dmax = date_ref + np.timedelta64(delta_t,'D')

    #Let's first find all strikes within a temporal range
    ind = other_strikes.Date.searchsorted([date_ref-delta_t,date_ref+delta_t])
    nearby_strikes = other_strikes.loc[ind[0]:ind[1]-1].copy()

    if len(nearby_strikes) == 0:
        return False

    #Let's compute spatial distance now:
    nearby_strikes['distance'] = get_distance([lat_ref,long_ref], nearby_strikes[['Lat','Lon']])

    nearby_strikes = nearby_strikes[nearby_strikes['distance']<=delta_d]

    return (len(nearbystrikes)>0)

现在您的所有功能都准备就绪，您可以在数据框上使用apply：

data['presence of nearby strike'] = data[['Date','Lat','Lon']].apply(lambda x: is_there_a_strike_nearby(x['Date'],x['Lat'],x['Long'], delta_t, delta_d,data)

就是这样，您现在已在数据框中创建了一个新列，用于指示您的警示是否已隔离（False）与否（True），从而轻松创建新数据框。

这种方法的问题是它仍然很长。有一些方法可以加快速度，例如将is_there_a_strike_nearby更改为您的data按lat和long排序的其他参数，并使用其他searchsorted过滤Lat和在计算距离之前Long（例如，如果您希望在10公里范围内进行罢工，则可以使用0.09的delta_Lat进行过滤。）

对此方法的任何反馈都非常欢迎！

Answer 2

根据您的数据，这可能有用或无用。有些罢工可能会被隔离＆＃34;及时，即比时间阈值更远离罢工之前和之后的罢工。您可以使用这些删除将数据分成组，然后可以使用searchsorted按照ysearka建议的行处理这些组。如果您的数据最终分成数百个组，则可能会节省时间。

以下是代码的外观：

# first of all, convert to timestamp
df['DateTime'] = pd.to_datetime(df['Date'].astype(str) + 'T' + df['Time'])

# calculate the time difference with previous and following strike
df['time_separation'] = np.minimum( df['DateTime'].diff().values, 
                                   -df['DateTime'].diff(-1).values)
# using a specific threshold for illustration
df['is_isolated'] = df['time_separation'] > "00:00:00.08"
# define groups
df['group'] = (df['is_isolated'] != df['is_isolated'].shift()).cumsum()
# put isolated strikes into a separate group so they can be skipped
df.loc[df['is_isolated'], 'group'] = -1

这是输出，具有我使用的特定阈值：

       Lat      Lon                      DateTime is_isolated  group
0  -7.1961 -60.7604 2016-01-01 00:00:00.996269200       False      1
1  -7.0518 -60.6911 2016-01-01 00:00:01.064620700       False      1
2 -25.3913 -57.2922 2016-01-01 00:00:01.110206600       False      1
3  -7.4842 -60.5129 2016-01-01 00:00:01.201857300        True     -1
4  -7.3939 -60.4992 2016-01-01 00:00:01.294275000        True     -1
5  -9.6386 -62.8448 2016-01-01 00:00:01.443149300       False      3
6 -23.7089 -58.8888 2016-01-01 00:00:01.522615700       False      3
7  -6.3513 -55.6545 2016-01-01 00:00:01.593241200       False      3
8 -23.8019 -58.9382 2016-01-01 00:00:01.673635000       False      3
9 -24.5724 -57.7229 2016-01-01 00:00:01.695785800       False      3

Answer 3

这是最初看起来很容易的问题之一，但你对它的思考越多，你的头脑就越融化！我们基本上得到了一个三维（Lat，Lon，Time）聚类问题，然后根据聚类大小进行过滤。有一些问题有点像这样（虽然更抽象），反应往往涉及scipy。 Check out this one。我还会检查模糊c-means聚类。 Here is the skfuzzy example

在你的情况下，测地距离可能是关键，在这种情况下你可能不想忽视计算距离。高数学的例子有点遗漏了这一点。

如果准确性不重要，可能有更基本的方法，例如创建任意时间＆＃39; bins＆＃39;使用dataframe.cut或类似的。在速度和准确度之间存在最佳尺寸。例如，如果您切入t/4箱（1800秒），并且因为时间距离很远，那么您的实际时差可能是5401-8999。 An example of cutting。应用类似于lon和lat坐标的东西，并对近似值进行计算会更快。

希望有所帮助。

Answer 4

您可以使用一些无监督的ML算法来提高速度。在使用ML算法之前需要做一些数据转换。例如：

转换＆＃34;日期＆＃34;，＆＃34;时间戳＆＃34;进入一个列功能＆＃34;时间戳＆＃34;。
可以使用原始＆＃34; Lat＆＃34;，＆＃34; Lon＆＃34;但是当我们将它们合并为一个时，它可能会有所帮助。常用方法计算距某个任意点的距离（它可能是区域的中心），有时为了增加地理定位的重要性，您可以使用多个点来测量距离他们的距离。对于距离计算，您可以使用ysearka中的get_distance。
数据缩放（http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler）。尝试使用和不使用它。

在数据预处理之后，您可以简单地使用scikit-learn聚类算法（http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster），用于在群集中排列数据 .KMeans开始的好点。

另外，请注意NearestNeighbors（http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html）按相似顺序搜索具体数量的对象。

如何有效地比较pandas DataFrame中的行？

4 个答案: