2412

# Distance calculation optimization in R

<h3>Question</h3>

I would like to know if there is any way to optimize the distance calculation process below. I left a small example below, however I am working with a spreadsheet with more than 6000 rows, and it takes considerable time to calculate the variable d. It would be possible to somehow adjust this to have the same results, but in an optimized way.

```library(rdist) library(tictoc) library(geosphere) time<-tic() df<-structure(list(Industries=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), Latitude = c(-23.8, -23.8, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, + + -23.9, -23.9, -23.9, -23.9, -23.9), Longitude = c(-49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.7, + + -49.7, -49.7, -49.7, -49.7, -49.6, -49.6, -49.6, -49.6)), class = "data.frame", row.names = c(NA, -19L)) k=3 #clusters coordinates<-df[c("Latitude","Longitude")] d<-as.dist(distm(coordinates[,2:1])) fit.average<-hclust(d,method="average") clusters<-cutree(fit.average, k) nclusters<-matrix(table(clusters)) df\$cluster <- clusters time<-toc() 1.54 sec elapsed d 1 2 3 4 5 6 7 8 2 0.00 3 11075.61 11075.61 4 11075.61 11075.61 0.00 5 11075.61 11075.61 0.00 0.00 6 11075.61 11075.61 0.00 0.00 0.00 7 11075.61 11075.61 0.00 0.00 0.00 0.00 8 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 9 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00 10 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00 11 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02 12 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02 13 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02 14 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02 15 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02 16 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00 17 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00 18 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00 19 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00 9 10 11 12 13 14 15 16 2 3 4 5 6 7 8 9 10 0.00 11 10183.02 10183.02 12 10183.02 10183.02 0.00 13 10183.02 10183.02 0.00 0.00 14 10183.02 10183.02 0.00 0.00 0.00 15 10183.02 10183.02 0.00 0.00 0.00 0.00 16 0.00 0.00 10183.02 10183.02 10183.02 10183.02 10183.02 17 0.00 0.00 10183.02 10183.02 10183.02 10183.02 10183.02 0.00 18 0.00 0.00 10183.02 10183.02 10183.02 10183.02 10183.02 0.00 19 0.00 0.00 10183.02 10183.02 10183.02 10183.02 10183.02 0.00 17 18 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0.00 19 0.00 0.00 ``` <h3>Comparation</h3> ```> df\$cluster <- clusters > df Industries Latitude Longitude cluster 1 1 -23.8 -49.6 1 2 2 -23.8 -49.6 1 3 3 -23.9 -49.6 2 4 4 -23.9 -49.6 2 5 5 -23.9 -49.6 2 6 6 -23.9 -49.6 2 7 7 -23.9 -49.6 2 8 8 -23.9 -49.6 2 9 9 -23.9 -49.6 2 10 10 -23.9 -49.6 2 11 11 -23.9 -49.7 3 12 12 -23.9 -49.7 3 13 13 -23.9 -49.7 3 14 14 -23.9 -49.7 3 15 15 -23.9 -49.7 3 16 16 -23.9 -49.6 2 17 17 -23.9 -49.6 2 18 18 -23.9 -49.6 2 19 19 -23.9 -49.6 2 > clustered_df Industries Latitude Longitude cluster Dist Cluster 1 11 -23.9 -49.7 3 0.00 1 2 12 -23.9 -49.7 3 0.00 1 3 13 -23.9 -49.7 3 0.00 1 4 14 -23.9 -49.7 3 0.00 1 5 15 -23.9 -49.7 3 0.00 1 6 3 -23.9 -49.6 2 10183.02 2 7 4 -23.9 -49.6 2 0.00 2 8 5 -23.9 -49.6 2 0.00 2 9 6 -23.9 -49.6 2 0.00 2 10 7 -23.9 -49.6 2 0.00 2 11 8 -23.9 -49.6 2 0.00 2 12 9 -23.9 -49.6 2 0.00 2 13 10 -23.9 -49.6 2 0.00 2 14 16 -23.9 -49.6 2 0.00 2 15 17 -23.9 -49.6 2 0.00 2 16 18 -23.9 -49.6 2 0.00 2 17 19 -23.9 -49.6 2 0.00 2 18 1 -23.8 -49.6 1 11075.61 3 19 2 -23.8 -49.6 1 0.00 3 ```
```# Order the dataframe by Lon and Lat: ordered_df => data.frame ordered_df <- df %>% arrange(., Longitude, Latitude) # Scalar valued at how many clusters we are expecting => integer vector k = 3 # Matrix of co-ordinates: coordinates => matrix coordinates <- ordered_df %>% select(Longitude, Latitude) %>% as.matrix() # Generate great circle distances between points and Long-Lat Matrix: d => data.frame d <- data.frame(Dist = c(0, distVincentyEllipsoid(coordinates))) # Segment the distances into groups: cluster => factor d\$Cluster <- factor(cumsum(d\$Dist > (quantile(d\$Dist, 1/k))) + 1) # Merge with base data: clustered_df => data.frame clustered_df <- cbind(ordered_df, d) ```
```library(geosphere) library(dplyr) df <- structure(list(Industries=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), Latitude = c(-23.8, -23.8, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9), Longitude = c(-49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.7,-49.7, -49.7, -49.7, -49.7, -49.6, -49.6, -49.6, -49.6)), class = "data.frame", row.names = c(NA, -19L)) start_time <- Sys.time() ```