Question

stu.csv包含850,000行和3列。第二列是ID的经度，第三列是ID的纬度。 stu.csv文件中的数据如下：

   ID    longitude    latitude  
  156   41.88367183 12.48777756
  187   41.92854333 12.46903667
  297   41.89106861 12.49270456
  89    41.79317669 12.43212196
  79    41.90027472 12.46274618
  ...       ...         ...

伪码如下。它的目的是用经度和纬度计算地球表面上两个ID之间的距离，并输出任意两个ID的累积和：

  dlon = lon2 - lon1
  dlat = lat2 - lat1
  a = (sin(dlat/2))^2 + cos(lat1) * cos(lat2) * (sin(dlon/2))^2
  c = 2 * atan2( sqrt(a), sqrt(1-a) )
  distance = 6371000 * c (where 6371000 is the radius of the Earth)

此代码如下，但运行速度太慢。如何加速和重写代码？谢谢。

    stu<-read.table("stu.csv",header=T,sep=",");

    ## stu has 850,000 rows and 3 columns.

    m<-nrow(stu);

    distance<-0;

    for (i in 1:(m-1))
    {
      for (j in (i+1))
      {     
        dlon = stu[j,2] - stu[i,2];
        dlat = stu[j,3] - stu[i,3];
        a = (sin(dlat/2))^2 + cos(stu[i,3]) * cos(stu[j,3]) * (sin(dlon/2))^2;
        c = 2 * atan2( sqrt(a), sqrt(1-a) );
        distance <-distance+6371000 * c;
       }
    }

    distance

Answer 1

对于您的情况，如果是累积距离，我们可以进行矢量化：

x <- read.table(text = "ID    longitude    latitude  
156   41.88367183 12.48777756
187   41.92854333 12.46903667
297   41.89106861 12.49270456
89    41.79317669 12.43212196
79    41.90027472 12.46274618", header= TRUE)


x$laglon <- dplyr::lead(x$longitude, 1)
x$laglat <- dplyr::lead(x$latitude, 1)


distfunc <- function(long, lat, newlong, newlat){
  dlon = newlong - long
  dlat = newlat - lat
  a = (sin(dlat/2))^2 + cos(lat) * cos(newlat) * (sin(dlon/2))^2
  c = 2 * atan2( sqrt(a), sqrt(1-a) )
  6371000 * c 
}

distfunc(x$longitude, x$latitude, x$laglon, x$laglat)
308784.6 281639.6 730496.0 705004.2       NA

获取输出的cumsum以获得总距离。

在一百万行上，我的系统需要大约0.4秒

Answer 2

您正在寻找的内容被称为＆＃34;矢量化循环。＆＃34;请参阅this related question。

基本思想是在一个循环中，CPU必须在继续处理第一个单元格之后再继续处理第二个单元格，除非它有一定的保证第一个单元格的处理方式不会影响状态。第二个细胞。但如果它是矢量计算，那么该保证就存在，并且它可以同时处理尽可能多的元素，从而提高速度。（还有其他原因可以解决这个问题，但这是基本动机。）

请参阅R中的this introduction至apply，了解如何在没有循环的情况下重写代码。（你应该能够保持大部分计算。）

R代码运行速度太慢，如何加速和重写此代码

2 个答案: