Question

背景：

我已经获得了四个数据目录，其中第一个（让我们称之为Cat1）给出了字段1和2中的无线电源的坐标（右上升和下降，RA和Dec），第二个目录（Cat2）给出了场1中无线电源和红外（IR）源的RA和Dec，第三个目录（Cat3）给出了场2中无线电和红外源的RA和Dec，第四个目录（Cat4）给出了RA和Dec对于1和2领域的光源。

Cat1有大约2000个字段2的来源，请记住，有些来源实际上是在其维度上多次测量的，例如;源1，源2，源3a，源3b，源3c，源4 ...... Cat1有大约3000个字段1的来源，同样有一些来源是部分。 Cat 1是.dat文件，我在textedit中打开，并转换为.txt以与np.genfromtxt一起使用。

Cat2拥有大约1700个字段1的来源。 Cat3拥有大约1700个字段2的来源。 Cat2和Cat3是.csv文件，我在Numbers中打开。

Cat4有大约1200个字段1的来源，大约700个字段2的来源.Cat4是.dat文件，我在textedit中打开，并转换为.txt用于np.genfromtxt。

还想出了如何在.csv文件中转换Cat1和Cat4。

目的：

目标是将这四个目录组合成一个目录，从Cat2，Cat1和Cat4（字段1）得到RA和Dec，然后从Cat3，Cat1和Cat4（字段2）得到RA和Dec，例如来自Cat1和Cat4的RA和Dec最接近来自Cat1或Cat2的RA和Dec，因此可以说它们很可能是相同的来源。重叠的程度会有所不同，但我已经为数据生成散点图，显示每个Cat2和Cat3源都有相应的Cat1和Cat4源，在绘图标记大小的精度范围内，当然还有很多剩余的来源在Cat1和Cat4中，它包含的信息源多于Cat2和Cat3。

诀窍是因为坐标不完全匹配，我需要先查看RA并找到最佳匹配值，然后查看相应的Dec，并检查它是否也是最佳对应值。

例如，对于Cat2中的来源：RA = 53.13360595，Dec = -28.0530758

Cat1：RA = 53.133496，Dec = -27.553401 或RA = 53.133873，Dec = -28.054600

这里，53.1336同样在53.1334和53.1338之间，但显然-28.053比-27.553更接近-28.054，因此Cat1中的第二个选项是赢家。

进度：

到目前为止，我已经将Cat2中的前15个值与Cat1中的值完全匹配（命令+ f到尽可能多的小数位，然后使用最佳判断），但显然这对于所有3400个源来说效率非常低Cat2和Cat3。我只是想亲眼看看匹配中期望的精确度，不幸的是，有些匹配到第二或第三位小数，而其他匹配到第四或第五位。

在制作散点图时，我使用了代码：

cat1 = np.genfromtext('filepath/cat1.txt', delimiter = '   ')
RA_cat1 = cat1[:,][:,0]
Dec_cat1 = cat1[:,][:,1]

然后简单地针对Dec_cat1绘制RA_cat1，并对我的所有目录重复。

我现在的问题是，在搜索有关如何生成能够匹配我的坐标的代码的答案时，我看到了许多将数组转换为列表的答案，但是在尝试使用<时/ p>

cat1list = np.array([RA_cat1, Dec_cat1])
cat1list.tolist()

我最终得到了一份表格清单;

[RA1，RA2，RA3，...，RA3000]，[Dec1，Dec2，Dec3，...，Dec3000]

而不是我认为会更有帮助;

[RA1，Dec1]，[RA2，Dec2]，......，[RA3000，Dec3000]。

此外，对于类似的问题，一旦列表转换成功，最有用的答案似乎是使用词典，但我不清楚如何使用字典来产生我上面描述的各种比较。

此外，我应该提一下，一旦我成功完成了这项任务，我就被要求重复这个过程以获得更大的数据集，我不确定它有多大，但我假设可能有几十个成千上万的坐标集。

Answer 1

对于您拥有的数据量，您可以计算每对点之间的距离度量。类似的东西：

def close_enough(p1, p2):
    # You may need to scale the RA difference with dec. 
    return (p1.RA - p2.RA)**2 + (p1.Dec - p2.Dec)**2) < 0.01

candidates = [(p1,p2) for p1,p2 in itertools.combinations(points, 2)
              if close_enough(p1,p2)]

对于大型数据集，您可能希望使用线扫描算法仅计算同一邻域中的点的度量。像这样：

import itertools as it
import operator as op
import sortedcontainers     # handy library on Pypi
import time

from collections import namedtuple
from math import cos, degrees, pi, radians, sqrt
from random import sample, uniform

Observation = namedtuple("Observation", "dec ra other")

生成一些测试数据

number_of_observations = 5000
field1 = [Observation(uniform(-25.0, -35.0),     # dec
                      uniform(45.0, 55.0),       # ra
                      uniform(0, 10))            # other data
          for shop_id in range(number_of_observations)]

# add in near duplicates
number_of_dups = 1000
dups = []
for obs in sample(field1, number_of_dups):
    dDec = uniform(-0.0001, 0.0001)
    dRA  = uniform(-0.0001, 0.0001)
    dups.append(Observation(obs.dec + dDec, obs.ra + dRA, obs.other))

data = field1 + dups

这是算法：

# Note: dec is first in Observation, so data is sorted by .dec then .ra.
data.sort()

# Parameter that determines the size of a sliding declination window
# and therefore how close two observations need to be to be considered
# observations of the same object.
dec_span = 0.0001

# Result. A list of observation pairs close enough to be considered 
# observations of the same object.
candidates = []

# Sliding declination window.  Within the window, observations are
# ordered by .ra.
window = sortedcontainers.SortedListWithKey(key=op.attrgetter('ra'))

# lag_obs is the 'southernmost' observation within the sliding declination window.
observation = iter(data)
lag_obs = next(observation)

# lead_obs is the 'northernmost' observation in the sliding declination window.
for lead_obs in data:

    # Dec of lead_obs represents the leading edge of window.
    window.add(lead_obs)

    # Remove observations further than the trailing edge of window.
    while lead_obs.dec - lag_obs.dec > dec_span:
        window.discard(lag_obs)
        lag_obs = next(observation)

    # Calculate 'east-west' width of window_size at dec of lead_obs
    ra_span = dec_span / cos(radians(lead_obs.dec))
    east_ra = lead_obs.ra + ra_span
    west_ra = lead_obs.ra - ra_span

    # Check all observations in the sliding window within
    # ra_span of lead_obs.
    for other_obs in window.irange_key(west_ra, east_ra):

        if other_obs != lead_obs:
            # lead_obs is at the top center of a box 2 * ra_span wide by 
            # 1 * ra_span tall.  other_obs is is in that box. If desired, 
            # put additional fine-grained 'closeness' tests here. 
            # For example:
            #    average_dec = (other_obs.dec + lead_obs.dec) / 2
            #    delta_dec = other_obs.dec - lead_obs.dec
            #    delta_ra  = other_obs.ra - lead_obs.ra)/cos(radians(average_dec))
            # e.g. if delta_dec**2 + delta_ra**2 < threshold:
            candidates.append((lead_obs, other_obs))

在我的笔记本电脑上，它找到＆lt;十分之一秒。

如何使用Python匹配类似的坐标？

1 个答案: