Question

我有一个函数可以将pandas Series返回到数据帧中的2列。目前我的代码如下所示：

def firstSite(coords, lat, long, date):
    df1 = coords[coord['Date2'] <= date]
    df1['distance'] = df1.apply(
        lambda row: distance(lat, long, row['lat2'], row['long2'],
        axis = 1)

    df2 = df1.loc[df1.distance <= 2].nsmallest(1, 'Date2')[['Site Name','distance']] 

    return pd.Series([b2['Site Name'],b2['distance']])

df[['A','B']] = df.apply(
    lambda row: firstSite(coords, row['lat'], row['lng'], row['Date'],
    axis = 1)

目前，它返回一个pandas系列，其值为df2。但是，当我查看函数外部的输出时，输出如下所示：

ID Date pc_lat pc_long A                                            B

A  2016 51.5   -1.0    Series([], Name: Site Name, dtype: object)   Series([], Name: distance, dtype: float64)
B  2016 51.6   -1.2    Series([], Name: Site Name, dtype: object)   Series([], Name: distance, dtype: float64)
C  2016 51.6   -1.2    Series([], Name: Site Name, dtype: object)   Series([], Name: distance, dtype: float64)
D  2016 51.6   -1.2    20    Drax Biomass Power Station - Unit 1 Name: Site Name, dtype: object 20    1.921752 Name: distance, dtype: float64
E  2016 51.5   -1.1    Series([], Name: Site Name, dtype: object)   Series([], Name: distance, dtype: float64)

我显然已经退回了熊猫系列，而不是熊猫系列值 - 但是如果我将代码更改为：

return pd.Series([b2['Site Name'],b2['distance']]).values

我收到错误。如何修改我的代码以返回网站名称＆＃39; ＆安培; ＆＃39;距离＆＃39;来自b2的值？

此外，我在这里稍微弄乱了一些列标题，所以其中一些实际上没有任何意义，但我只是在寻找解决问题的方法，我可以返回空列表/ NaN或值。

我的模拟CSV中的值的一个例子是＆＃34; Drax Biomass Power Station - Unit 1＆＃34;＆＃34;网站名称＆amp; ＆＃34; 1.921752＆＃34;为了距离。我不想要关于该系列的所有其他信息。

编辑：

好的，所以我使用了我在这里链接的Haversine公式。这是我的距离函数：

def distanceBetweenCm(lat1, lon1, lat2, lon2):
    """
    https://stackoverflow.com/questions/44910530/
    how-to-find-the-distance-between-2-points-in-2-different-dataframes-in-pandas/44910693#44910693
    Haversine Formula: https://en.wikipedia.org/wiki/Haversine_formula
    """
    dLat = math.radians(lat2-lat1)
    dLon = math.radians(lon2-lon1)

    lat1 = math.radians(lat1)
    lat2 = math.radians(lat2)

    a = math.sin(dLat/2)**2 + math.sin(dLon/2)**2 * math.cos(lat1) * math.cos(lat2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    return c * 6371 #multiply by 100k to get distance in cm

我的代码尝试查找在特定半径（2km）内为CSV事务构建的第一个站点。这是函数firstSite：

def firstSite(biomass, lat, long, date):

    #Only if the Date of Operation of Biomass is after transaction date, 
    b1 = biomass[biomass['Date of Operation'] <= date]

    #New distance column which is the distance between the two sets of points
    b1['distance'] = b1.apply(
        lambda row: distanceBetweenCm(lat, long, row['Lat'], row['Lng']), 
        axis=1)

    #Create new dataframe where the smallest record from biomass within 2km is selected
    b2 = b1.loc[b1.distance <= 2].nsmallest(1, 'Date of Operation')[['Site Name','distance']]
    if b2.empty:                        
        b2.loc[0] = [np.nan, np.nan]  
    return pd.Series([b2['Site Name'],b2['distance']])

我已经玩过删除下面的代码，因为它使得它更快。：

    if b2.empty:                        
        b2.loc[0] = [np.nan, np.nan]

我有另一个功能，我用CSV读取交易，读取完整的生物质网站的CSV。然后我将生物量CSV限制在交易之前构建的地点（尽管我可能需要在之前和之后的交易之后进行交易）＆amp;然后我在事务数据帧（df1）＆amp;上运行firstSite函数。写入输出CSV。

def addBioData(csv1, csv2, year):
    df1 = pd.read_csv(csv1)
    bio = "Biomass\PytAny\BiomassOp.csv"
    biomass = pd.read_csv(bio)
    print("Input Bio CSV: "+str(bio))

    dt = datetime.date(year + 1, 1, 1)
    biomass['Date of Operation']  = pd.to_datetime(biomass['Date of Operation'])
    biomassyr = biomass[biomass['Date of Operation'] < dt]
    df1[['FS2km', 'FS2kmDist']] = df1.apply(
        lambda row: firstSite(biomassyr, row['pc_lat'], row['pc_long'], row['Date']),
        axis = 1)
    print(df1)

    df1.to_csv(csv2,index=None,encoding='utf-8')

如果有比使用.apply更快的方式，我会非常感兴趣！我将在一秒内使用样本csv在pastebin中进行编辑。

biomass CSV

Transaction Price CSV

Output CSV

我制作了一个我想完成的模拟版本。基本上，我想要建立第一个站点的站点名称（按日期），它位于事务坐标的2公里范围内。如果2km内没有任何生物质站点，则值为＆＃34; Null＆＃34;或NaN。

返回一个熊猫系列，但只返回值而不是实际的系列

0 个答案: