我有一个函数可以将pandas Series返回到数据帧中的2列。目前我的代码如下所示:
def firstSite(coords, lat, long, date):
df1 = coords[coord['Date2'] <= date]
df1['distance'] = df1.apply(
lambda row: distance(lat, long, row['lat2'], row['long2'],
axis = 1)
df2 = df1.loc[df1.distance <= 2].nsmallest(1, 'Date2')[['Site Name','distance']]
return pd.Series([b2['Site Name'],b2['distance']])
df[['A','B']] = df.apply(
lambda row: firstSite(coords, row['lat'], row['lng'], row['Date'],
axis = 1)
目前,它返回一个pandas系列,其值为df2。但是,当我查看函数外部的输出时,输出如下所示:
ID Date pc_lat pc_long A B
A 2016 51.5 -1.0 Series([], Name: Site Name, dtype: object) Series([], Name: distance, dtype: float64)
B 2016 51.6 -1.2 Series([], Name: Site Name, dtype: object) Series([], Name: distance, dtype: float64)
C 2016 51.6 -1.2 Series([], Name: Site Name, dtype: object) Series([], Name: distance, dtype: float64)
D 2016 51.6 -1.2 20 Drax Biomass Power Station - Unit 1 Name: Site Name, dtype: object 20 1.921752 Name: distance, dtype: float64
E 2016 51.5 -1.1 Series([], Name: Site Name, dtype: object) Series([], Name: distance, dtype: float64)
我显然已经退回了熊猫系列,而不是熊猫系列值 - 但是如果我将代码更改为:
return pd.Series([b2['Site Name'],b2['distance']]).values
我收到错误。如何修改我的代码以返回网站名称&#39; &安培; &#39;距离&#39;来自b2的值?
此外,我在这里稍微弄乱了一些列标题,所以其中一些实际上没有任何意义,但我只是在寻找解决问题的方法,我可以返回空列表/ NaN或值。
我的模拟CSV中的值的一个例子是&#34; Drax Biomass Power Station - Unit 1&#34;&#34;网站名称&amp; &#34; 1.921752&#34;为了距离。我不想要关于该系列的所有其他信息。
编辑:
好的,所以我使用了我在这里链接的Haversine公式。这是我的距离函数:
def distanceBetweenCm(lat1, lon1, lat2, lon2):
"""
https://stackoverflow.com/questions/44910530/
how-to-find-the-distance-between-2-points-in-2-different-dataframes-in-pandas/44910693#44910693
Haversine Formula: https://en.wikipedia.org/wiki/Haversine_formula
"""
dLat = math.radians(lat2-lat1)
dLon = math.radians(lon2-lon1)
lat1 = math.radians(lat1)
lat2 = math.radians(lat2)
a = math.sin(dLat/2)**2 + math.sin(dLon/2)**2 * math.cos(lat1) * math.cos(lat2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
return c * 6371 #multiply by 100k to get distance in cm
我的代码尝试查找在特定半径(2km)内为CSV事务构建的第一个站点。这是函数firstSite:
def firstSite(biomass, lat, long, date):
#Only if the Date of Operation of Biomass is after transaction date,
b1 = biomass[biomass['Date of Operation'] <= date]
#New distance column which is the distance between the two sets of points
b1['distance'] = b1.apply(
lambda row: distanceBetweenCm(lat, long, row['Lat'], row['Lng']),
axis=1)
#Create new dataframe where the smallest record from biomass within 2km is selected
b2 = b1.loc[b1.distance <= 2].nsmallest(1, 'Date of Operation')[['Site Name','distance']]
if b2.empty:
b2.loc[0] = [np.nan, np.nan]
return pd.Series([b2['Site Name'],b2['distance']])
我已经玩过删除下面的代码,因为它使得它更快。:
if b2.empty:
b2.loc[0] = [np.nan, np.nan]
我有另一个功能,我用CSV读取交易,读取完整的生物质网站的CSV。然后我将生物量CSV限制在交易之前构建的地点(尽管我可能需要在之前和之后的交易之后进行交易)&amp;然后我在事务数据帧(df1)&amp;上运行firstSite函数。写入输出CSV。
def addBioData(csv1, csv2, year):
df1 = pd.read_csv(csv1)
bio = "Biomass\PytAny\BiomassOp.csv"
biomass = pd.read_csv(bio)
print("Input Bio CSV: "+str(bio))
dt = datetime.date(year + 1, 1, 1)
biomass['Date of Operation'] = pd.to_datetime(biomass['Date of Operation'])
biomassyr = biomass[biomass['Date of Operation'] < dt]
df1[['FS2km', 'FS2kmDist']] = df1.apply(
lambda row: firstSite(biomassyr, row['pc_lat'], row['pc_long'], row['Date']),
axis = 1)
print(df1)
df1.to_csv(csv2,index=None,encoding='utf-8')
如果有比使用.apply更快的方式,我会非常感兴趣!我将在一秒内使用样本csv在pastebin中进行编辑。
我制作了一个我想完成的模拟版本。基本上,我想要建立第一个站点的站点名称(按日期),它位于事务坐标的2公里范围内。如果2km内没有任何生物质站点,则值为&#34; Null&#34;或NaN。