78813

Distributed TensorFlow [Async, Between-Graph Replication]: which are the exactly interaction between

Question:

I've read <a href="https://www.tensorflow.org/deploy/distributed" rel="nofollow">Distributed TensorFlow Doc</a> and <a href="https://stackoverflow.com/q/43147435/9099269" rel="nofollow">this question on StackOverflow</a> but I still have some doubt about the dynamics behind the distributed training that can be done with TensorFlow and its Parameter Server Architecture. This is a snipped of code from the Distributed TensorFlow Doc:

if FLAGS.job_name == "ps": server.join() elif FLAGS.job_name == "worker": # Assigns ops to the local worker by default. with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" % FLAGS.task_index, cluster=cluster)): # Build model... loss = ... global_step = tf.contrib.framework.get_or_create_global_step() train_op = tf.train.AdagradOptimizer(0.01).minimize( loss, global_step=global_step)

And here part of the answer of the StackOverflow question that I read:

<blockquote>

The worker reads all of the shared model parameters in parallel from the PS task(s), and copies them to the worker task. These reads are uncoordinated with any concurrent writes, and no locks are acquired: in particular the worker may see partial updates from one or more other workers (e.g. a subset of the updates from another worker may have been applied, or a subset of the elements in a variable may have been updated).

The worker computes gradients locally, based on a batch of input data and the parameter values that it read in step 1.

The worker sends the gradients for each variable to the appropriate PS task, and applies the gradients to their respective variable, using an update rule that is determined by the optimization algorithm (e.g. SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules typically use (approximately) commutative operations, so they may be applied independently on the updates from each worker, and the state of each variable will be a running aggregate of the sequence of updates received.

</blockquote>

I have to reproduce this kind of parameter server architecture in another environment and I need to deeply understand how workers and PS tasks interact with each other inside the TensorFlow framework. My question is, does the PS task do some kind of merging or updating operation after receiving the value from the workers or it just store the newest value ? Can be something reasonable just storing the newest value ? Looking at the code from the TensorFlow documentation I see that the PS task just do a join() and I wonder behind this method call which are the complete behaviour of the PS task.

One more question, what is the difference between compute a gradient and apply a gradient ?

Answer1:

Let's go in reverse order and start from your last question: <strong>what is the difference between compute a gradient and apply a gradient?</strong>

<em>Computing</em> the gradients means running the backward pass on the network, after having computed the loss. For gradient descent, this means estimating the gradients value in the formula beneath (note: this is a <em>huge</em> simplification of what computing gradients actually entails, look up more about backpropagation and gradient descent fora proper explanation of how this works). <em>Applying</em> the gradients means updating the parameters according to the gradients you just computed. For gradient descent, this (roughly) means executing the following:

weights = weights - (learning_step * gradients)

Note that, depending on the value of learning_step, the new value of weights depends on both the previous value and the computed weights.

With this in mind, it's easier to understand the PS/worker architecture. Let's make the simplifying assumption that there is only one PS (we'll see later how to extend to multi-PS)

A PS (<em>parameter server</em>) keeps in memory the weights (i.e. the parameters) and receives gradients, running the update step I wrote in the code above. It does this every time it receives gradients from a worker.

A worker, on the other hand, looks up what's the <em>current</em> value of weights in the PS, makes a copy of it locally, runs a forward and a backward pass of the network on a batch of data and gets new gradients, which then sends back to the PS.

Note the emphasis on "current": <strong>there is no locking or inter-process synchronization between workers and the PS</strong>. If a worker reads weights in the middle of an update (and, for example, half already have the new value and half are still being updated), that's the weights he'll use for the next iteration. This keeps things fast.

<strong>What if there's more PSs?</strong> No problem! The parameters of the network are <em>partitioned</em> among the PSs, the worker simply contacts all of them to get the new values for each chunk of the parameters and sends back only the gradients relevant to each specific PS.

Recommend

  • log4J: Failure in post-close rollover action using TimeBasedRollingPolicy
  • How to use org.jboss.varia.property.SystemPropertiesService and org.jboss.util.property.PropertyList
  • Mule ESB: Are the Log4j Config for Batch in Mule need separate configuration?
  • Curved border CSS implementation
  • How to make exact CSS3 Linear gradient like it's in the image?
  • program/library to read photoshop gradient .grd file
  • SPSS creating a loop for a multiple regression over several variables
  • Gradient of tf.floor is None
  • Pointer arithmetic and 2-D array in c?
  • Is INT_MIN subtracted from any integer considered undefined behavior?
  • Is a single constant value considered an expression?
  • Output error rate per label / confusion matrix
  • Set User Control's default event
  • jQuery ajax security
  • How to capture or listen to browser notifications?
  • Using Node cluster module with SailsJs: EADDRINUSE
  • how tensorflow worker driver training process and cause variables update on ps job?
  • Unable to install breakpoint in Eclipse: Absent Line Number Information
  • How can I sum two different columns at once where one contains Decimal objects in pandas?
  • Downloading files from Google Storage using Spark (Python) and Dataproc
  • Capturing HTML Text Input Key press after key has been applied?
  • Handling exceptions in a class library enveloping a device driver
  • How to have a website splash screen (web app)
  • Unable to gem install nokogiri
  • Install different versions of nuget packages inside one solution file with two projects
  • Find unique tuples in a relation represented by a BDD
  • python - calculate orthographic similarity between words of a list
  • goJS dropdown remove items
  • Azure table query partial partitionkey guid match
  • Manually Timing out a C# Thread
  • How gzip file gets stored in HDFS
  • Problems with toDataURL HTML5 other ways to get canvas data?
  • Jenkins Grails plugin does not list lastest versions of Grails
  • How can I get the full list of running processes on a Mac from a python app
  • Implementing “partial void” in VB
  • Why HTML5 Canvas with a larger size stretch a drawn line?
  • Why doesn't :active or :focus work on text links in webkit? (safari & chrome)
  • When should I choose bucket sort over other sorting algorithms?
  • Unanticipated behavior
  • embed rChart in Markdown