'pc'
具有庞大的Pandas数据框:
pc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754851 entries, 0 to 1754850
Data columns (total 33 columns):
# Column Dtype
--- ------ -----
0 Latitude float64
1 Longitude float64
2 Easting Int64
3 Northing Int64
4 Grid Ref string
5 County string
6 District string
7 Ward string
8 Country string
9 Constituency string
10 Parish string
11 National Park string
12 Population Int64
13 Households Int64
14 Built up area string
15 Built up sub-division string
16 Lower layer super output area string
17 Rural/urban string
18 Region string
19 Altitude Int64
20 London zone string
21 Local authority string
22 Middle layer super output area string
23 Index of Multiple Deprivation string
24 Quality Int64
25 User Type Int64
26 Last updated string
27 Nearest station string
28 Distance to station float64
29 Police force string
30 Water company string
31 Plus Code string
32 Average Income Int64
dtypes: Int64(8), float64(3), string(22)
memory usage: 455.2 MB
有一个名为'Latitude'
的列和另一个名为'Longitude'
的列,我试图这样形成一个Geopandas地理数据框:
gdfpc = geopandas.GeoDataFrame(pc, geometry=geopandas.points_from_xy(pc.Longitude, df.Latitude))
这导致了以下错误:
ValueError: x and y arrays must be equal length.
呼叫pc.head()
和pc.tail()
无济于事:
pc.head()
Latitude Longitude Easting ... Water company Plus Code Average Income
0 57.149606 -2.096916 394235 ... Scottish Water 9C9V4WX3+R6 <NA>
1 57.148707 -2.097806 394181 ... Scottish Water 9C9V4WX2+FV <NA>
2 57.149051 -2.097004 394230 ... Scottish Water 9C9V4WX3+J5 <NA>
3 57.148080 -2.094664 394371 ... Scottish Water 9C9V4WX4+64 <NA>
4 57.150058 -2.095916 394296 ... Scottish Water 9C9V5W23+2J <NA>
[5 rows x 33 columns]
pc.tail()
Latitude Longitude ... Plus Code Average Income
1754846 59.889544 -1.307206 ... 9CFWVMQV+R4 <NA>
1754847 59.873651 -1.305697 ... 9CFWVMFV+FP <NA>
1754848 59.875286 -1.307502 ... 9CFWVMGR+4X <NA>
1754849 59.891572 -1.313847 ... 9CFWVMRP+JF <NA>
1754850 59.892392 -1.310899 ... 9CFWVMRQ+XJ <NA>
[5 rows x 33 columns]
寻找最大和最小的经度和纬度并没有发现可能提供线索的缺失值:
pc.nlargest(1, columns='Latitude')
Latitude Longitude ... Plus Code Average Income
1754598 60.800694 -0.869518 ... 9CGXR42J+75 <NA>
[1 rows x 33 columns]
pc.nlargest(1, columns='Longitude')
Latitude Longitude Easting ... Water company Plus Code Average Income
111540 4.610106 114.331172 <NA> ... <NA> <NA> <NA>
[1 rows x 33 columns]
pc.nsmallest(1, columns='Latitude')
Latitude Longitude Easting ... Water company Plus Code Average Income
111552 -51.796253 -59.523613 <NA> ... <NA> <NA> <NA>
[1 rows x 33 columns]
pc.nsmallest(1, columns='Longitude')
Latitude Longitude Easting ... Water company Plus Code Average Income
111544 34.924031 -117.891208 <NA> ... <NA> <NA> <NA>
[1 rows x 33 columns]
将相应的列转换为单独的Pandas系列,然后转换为numpy数组以进行进一步分析仍然无法发现任何可识别的差异:
>>>La = pc['Latitude']
>>>Lo = pc['Longitude']
>>>npLa=La.to_numpy(copy=True)
>>>npLo=Lo.to_numpy(copy=True)
>>>np.asarray(npLo).shape
(1754851,)
>>>np.asarray(npLa).shape
(1754851,)
>>>npLa.size
1754851
>>>npLo.size
1754851
在我辞职去各地使用Haversine公式之前有任何想法吗?