I've read <a href="https://www.tensorflow.org/deploy/distributed" rel="nofollow">Distributed TensorFlow Doc</a> and <a href="https://stackoverflow.com/q/43147435/9099269" rel="nofollow">this question on StackOverflow</a> but I still have some doubt about the dynamics behind the distributed training that can be done with TensorFlow and its Parameter Server Architecture. This is a snipped of code from the Distributed TensorFlow Doc:
if FLAGS.job_name == "ps": server.join() elif FLAGS.job_name == "worker": # Assigns ops to the local worker by default. with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" % FLAGS.task_index, cluster=cluster)): # Build model... loss = ... global_step = tf.contrib.framework.get_or_create_global_step() train_op = tf.train.AdagradOptimizer(0.01).minimize( loss, global_step=global_step)
And here part of the answer of the StackOverflow question that I read:<blockquote>
The worker reads all of the shared model parameters in parallel from the PS task(s), and copies them to the worker task. These reads are uncoordinated with any concurrent writes, and no locks are acquired: in particular the worker may see partial updates from one or more other workers (e.g. a subset of the updates from another worker may have been applied, or a subset of the elements in a variable may have been updated).
The worker computes gradients locally, based on a batch of input data and the parameter values that it read in step 1.
The worker sends the gradients for each variable to the appropriate PS task, and applies the gradients to their respective variable, using an update rule that is determined by the optimization algorithm (e.g. SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules typically use (approximately) commutative operations, so they may be applied independently on the updates from each worker, and the state of each variable will be a running aggregate of the sequence of updates received.</blockquote>
I have to reproduce this kind of parameter server architecture in another environment and I need to deeply understand how workers and PS tasks interact with each other inside the TensorFlow framework. My question is, does the PS task do some kind of merging or updating operation after receiving the value from the workers or it just store the newest value ? Can be something reasonable just storing the newest value ? Looking at the code from the TensorFlow documentation I see that the PS task just do a join() and I wonder behind this method call which are the complete behaviour of the PS task.
One more question, what is the difference between compute a gradient and apply a gradient ?Answer1:
Let's go in reverse order and start from your last question: <strong>what is the difference between compute a gradient and apply a gradient?</strong>
<em>Computing</em> the gradients means running the backward pass on the network, after having computed the loss. For gradient descent, this means estimating the
gradients value in the formula beneath (note: this is a <em>huge</em> simplification of what computing gradients actually entails, look up more about backpropagation and gradient descent fora proper explanation of how this works). <em>Applying</em> the gradients means updating the parameters according to the gradients you just computed. For gradient descent, this (roughly) means executing the following:
weights = weights - (learning_step * gradients)
Note that, depending on the value of
learning_step, the new value of
weights depends on both the previous value and the computed weights.
With this in mind, it's easier to understand the PS/worker architecture. Let's make the simplifying assumption that there is only one PS (we'll see later how to extend to multi-PS)
A PS (<em>parameter server</em>) keeps in memory the
weights (i.e. the parameters) and receives
gradients, running the update step I wrote in the code above. It does this every time it receives gradients from a worker.
A worker, on the other hand, looks up what's the <em>current</em> value of
weights in the PS, makes a copy of it locally, runs a forward and a backward pass of the network on a batch of data and gets new
gradients, which then sends back to the PS.
Note the emphasis on "current": <strong>there is no locking or inter-process synchronization between workers and the PS</strong>. If a worker reads
weights in the middle of an update (and, for example, half already have the new value and half are still being updated), that's the weights he'll use for the next iteration. This keeps things fast.
<strong>What if there's more PSs?</strong> No problem! The parameters of the network are <em>partitioned</em> among the PSs, the worker simply contacts all of them to get the new values for each chunk of the parameters and sends back only the gradients relevant to each specific PS.