2412

Distance calculation optimization in R

<h3>Question</h3>

I would like to know if there is any way to optimize the distance calculation process below. I left a small example below, however I am working with a spreadsheet with more than 6000 rows, and it takes considerable time to calculate the variable d. It would be possible to somehow adjust this to have the same results, but in an optimized way.

library(rdist) library(tictoc) library(geosphere) time<-tic() df<-structure(list(Industries=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), Latitude = c(-23.8, -23.8, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, + + -23.9, -23.9, -23.9, -23.9, -23.9), Longitude = c(-49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.7, + + -49.7, -49.7, -49.7, -49.7, -49.6, -49.6, -49.6, -49.6)), class = "data.frame", row.names = c(NA, -19L)) k=3 #clusters coordinates<-df[c("Latitude","Longitude")] d<-as.dist(distm(coordinates[,2:1])) fit.average<-hclust(d,method="average") clusters<-cutree(fit.average, k) nclusters<-matrix(table(clusters)) df$cluster <- clusters time<-toc() 1.54 sec elapsed d 1 2 3 4 5 6 7 8 2 0.00 3 11075.61 11075.61 4 11075.61 11075.61 0.00 5 11075.61 11075.61 0.00 0.00 6 11075.61 11075.61 0.00 0.00 0.00 7 11075.61 11075.61 0.00 0.00 0.00 0.00 8 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 9 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00 10 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00 11 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02 12 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02 13 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02 14 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02 15 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02 16 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00 17 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00 18 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00 19 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00 9 10 11 12 13 14 15 16 2 3 4 5 6 7 8 9 10 0.00 11 10183.02 10183.02 12 10183.02 10183.02 0.00 13 10183.02 10183.02 0.00 0.00 14 10183.02 10183.02 0.00 0.00 0.00 15 10183.02 10183.02 0.00 0.00 0.00 0.00 16 0.00 0.00 10183.02 10183.02 10183.02 10183.02 10183.02 17 0.00 0.00 10183.02 10183.02 10183.02 10183.02 10183.02 0.00 18 0.00 0.00 10183.02 10183.02 10183.02 10183.02 10183.02 0.00 19 0.00 0.00 10183.02 10183.02 10183.02 10183.02 10183.02 0.00 17 18 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0.00 19 0.00 0.00 <h3>Comparation</h3> > df$cluster <- clusters > df Industries Latitude Longitude cluster 1 1 -23.8 -49.6 1 2 2 -23.8 -49.6 1 3 3 -23.9 -49.6 2 4 4 -23.9 -49.6 2 5 5 -23.9 -49.6 2 6 6 -23.9 -49.6 2 7 7 -23.9 -49.6 2 8 8 -23.9 -49.6 2 9 9 -23.9 -49.6 2 10 10 -23.9 -49.6 2 11 11 -23.9 -49.7 3 12 12 -23.9 -49.7 3 13 13 -23.9 -49.7 3 14 14 -23.9 -49.7 3 15 15 -23.9 -49.7 3 16 16 -23.9 -49.6 2 17 17 -23.9 -49.6 2 18 18 -23.9 -49.6 2 19 19 -23.9 -49.6 2 > clustered_df Industries Latitude Longitude cluster Dist Cluster 1 11 -23.9 -49.7 3 0.00 1 2 12 -23.9 -49.7 3 0.00 1 3 13 -23.9 -49.7 3 0.00 1 4 14 -23.9 -49.7 3 0.00 1 5 15 -23.9 -49.7 3 0.00 1 6 3 -23.9 -49.6 2 10183.02 2 7 4 -23.9 -49.6 2 0.00 2 8 5 -23.9 -49.6 2 0.00 2 9 6 -23.9 -49.6 2 0.00 2 10 7 -23.9 -49.6 2 0.00 2 11 8 -23.9 -49.6 2 0.00 2 12 9 -23.9 -49.6 2 0.00 2 13 10 -23.9 -49.6 2 0.00 2 14 16 -23.9 -49.6 2 0.00 2 15 17 -23.9 -49.6 2 0.00 2 16 18 -23.9 -49.6 2 0.00 2 17 19 -23.9 -49.6 2 0.00 2 18 1 -23.8 -49.6 1 11075.61 3 19 2 -23.8 -49.6 1 0.00 3
<h3>Answer1:</h3>

@Jose Perhaps not as sound mathematically (in terms of the clustering) but (generally) a better measure of great circle distances (Vincenty's formulae). And ~8 times faster to achieve (what I think is your desired result) - (just using your sample data).

# Order the dataframe by Lon and Lat: ordered_df => data.frame ordered_df <- df %>% arrange(., Longitude, Latitude) # Scalar valued at how many clusters we are expecting => integer vector k = 3 # Matrix of co-ordinates: coordinates => matrix coordinates <- ordered_df %>% select(Longitude, Latitude) %>% as.matrix() # Generate great circle distances between points and Long-Lat Matrix: d => data.frame d <- data.frame(Dist = c(0, distVincentyEllipsoid(coordinates))) # Segment the distances into groups: cluster => factor d$Cluster <- factor(cumsum(d$Dist > (quantile(d$Dist, 1/k))) + 1) # Merge with base data: clustered_df => data.frame clustered_df <- cbind(ordered_df, d)

Libraries and sample data:

library(geosphere) library(dplyr) df <- structure(list(Industries=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), Latitude = c(-23.8, -23.8, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9), Longitude = c(-49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.7,-49.7, -49.7, -49.7, -49.7, -49.6, -49.6, -49.6, -49.6)), class = "data.frame", row.names = c(NA, -19L)) start_time <- Sys.time()

来源:https://stackoverflow.com/questions/62055976/distance-calculation-optimization-in-r

Recommend

  • Distance calculation optimization in R
  • Conflict when generating addPolylines on the map made by Leaflet
  • Wrapper for a function relying on non-standard evaluation in R
  • ValueError: time data does not match format '%Y-%m-%d %H:%M:%S.%f'
  • Custom notification maximum height?
  • Type conversion. What do I do with a PostgreSQL OID value in libpq in C?
  • How to return significant matches in R corrplot?
  • Verilog Error: Must be connected to a structural net expression
  • Parsing a CSV file for a multiple row rows using new Java 8 Streams API
  • One class instance used automatically throughout process
  • Check if PCollection is empty - Apache Beam
  • MongoDB count items in array
  • Wrong localisation in Firebase Cloud Function
  • Rename Bootstrap Tab with jQuery
  • How to redirect output of user defined class to view in rails
  • Insert Complete RecordSet to another table in other database MS Access
  • Inno Setup: Checking existence of a file in 32-bit System32 (Sysnative) folder
  • Form in Layout: Razor Pages
  • Copying contents of a file to another using read(), write(), open()
  • HDBSCAN Python choose number of clusters
  • candlestick plot from pandas dataframe, replace index by dates
  • How to use apoc.load.csv in conjunction with apoc.create.node
  • exception thrown while building the java application using netbeans
  • How to integrate Dialogflow with Django (Python)?
  • Getting an error serving images from App_Themes when using precompilation?
  • .htaccess rule: multiple domains to one ssl domain
  • Create a log file for a custom DNN module
  • DocuSign API Replace template document but keep fields
  • JavaFX Embed Custom Font Not Working
  • How convert html to BBcode in C#
  • C++ STL stack pop operation giving segmentation fault
  • How to specify generic type when the type is only known at runtime?
  • Change cell value based on cell color in google spreadsheet
  • How to integrate angular2-material (alpha 8.2) with angular2-Quickstart app
  • XEP-0166: Jingle protocol implementation for voice/video chat in iOS
  • Android Library Projects on Windows and Mac
  • Sql - ON DUPLICATE KEY UPDATE
  • What does the “id” field in an Android “Google Play Music” broadcast intent correspond to?