45851

Hazelcast - OperationTimeoutException

I am using Hazelcast version 3.3.1. I have a 9 node cluster running on aws using c3.2xlarge servers. I am using a distributed executor service and a distributed map. Distributed executor service uses a single thread. Distributed map is configured with no replication and no near-cache and stores about 1 million objects of size 1-2kb using Kryo serializer. My use case goes as follow:

    <li>All 9 nodes constantly execute a synchronous remote operation on the distributed executor service and generate about 20k hits per second (about ~2k per node).</li> <li>Invocations are executed using Hazelcast API: com.hazelcast.core.IExecutorService#executeOnKeyOwner.</li> <li>Each operation accesses the distributed map on the node owning the partition, does some calculation using the stored object and stores the object in to the map. (for that I use the get and set API of the IMap object).</li> </ul>

    Every once in a while Hazelcast encounters a timeout exceptions such as: com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:mapService', op=GetOperation{}, partitionId=212, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[172.31.44.2]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false} No response has been received! backups-expected:0 backups-completed: 0

    In some cases I see map partitions start to migrate which makes thing even worse, nodes constantly leave and re-join the cluster and the only way I can overcome the problem is by restarting the entire cluster.

    I am wondering what may cause Hazelcast to block a map-get operation for 120 seconds? I am pretty sure it's not network related since other services on the same servers operate just fine. Also note that the servers are mostly idle (~70%).

    Any feedbacks on my use case will be highly appreciated.

    Answer1:

    Why don't you make use of an entry processor? This is also send to the right machine owning the partition and the load, modify, store is done automatically and atomically. So no race problems. It will probably outperform the current approach significantly since there is less remoting involved.

    The fact that the map.get is not returning for 120 seconds is indeed very confusing. If you switch to Hazelcast 3.5 we added some logging/debugging stuff for this using the slow operation detector (executing side) and slow invocation detector (caller side) and should give you some insights what is happening.

    Do you see any Health monitor logs being printed?

Recommend

  • How to get the number of unread gmail mails [closed]
  • Using jQuery Templates in Struts2
  • Getting Error like imap_open(): Couldn't open stream in server
  • java.util.Scanner does not return to Prompt
  • Cannot create StorageItem in Outlook Add-In
  • Update statement containing aggregate not working in SQL server
  • Should ScheduledExecutorService.scheduleAt* methods re-schedule tasks if the task throws RuntimeExce
  • JPA/Hibernate - Entity name seems to be important. If I rename to “Bob” works fine
  • data.table replicate rows after join?
  • Ability to use Bootstrap 3 grid system to specify width of select element
  • Oracle ListaGG, Top 3 most frequent values, given in one column, grouped by ID
  • Azure table query partial partitionkey guid match
  • Translating C# to PowerShell in InterIMAP
  • Consuming a WCF service in a Java Client using wsHttpBinding
  • Servlet stops working on Tomcat server after some hits or time
  • what makes a request a new request in asp.net C#
  • C# fibonacci function returning errors
  • Merging rows to columns
  • Cannot upload to OneDrive using the new SDK
  • Keep this build forever option - Jenkins
  • Converting a WriteableBitmap image ToArray in UWP
  • iOS: Detect app start via notification press
  • How to attach a node.js readable stream to a Sendgrid email?
  • Debugging ASP.NET on a built-in web server suddenly stops
  • Is there any way to access browser form field suggestions from JavaScript?
  • Ajax jQuery multiple calls at the same time - long wait for answer and not able to cancel
  • Date difference with leap year
  • Is possible to count alias result on mysql
  • RestKit - RKRequestDelegate does not exist
  • AT Commands to Send SMS not working in Windows 8.1
  • File upload with ng-file-upload throwing error
  • Revoking OAuth Access Token Results in 404 Not Found
  • Rails 2: use form_for to build a form covering multiple objects of the same class
  • How to set the response of a form post action to a iframe source?
  • How do I configure my settings file to work with unit tests?
  • Change div Background jquery
  • Qt: Run a script BEFORE make
  • apache spark aggregate function using min value
  • Is it possible to post an object from jquery to bottle.py?
  • reshape alternating columns in less time and using less memory