I am working with flume to ingest a ton of data into hdfs (about petabytes of data). I would like to know how is flume making use of its distributed architecture? I have over 200 servers and I have installed flume in one of them from where I would get the data from (aka data source) and the sink is the hdfs. (hadoop is running over serengeti in these servers). I am not sure whether flume distributes itself over the cluster or I have installed it incorrectly. I followed apache's user guide for flume installation and this post of SO.
I am a newbie to flume and trying to understand more about it..Any help would be greatly appreciated. Thanks!!
I'm not going to speak to Cloudera's specific recommendations but instead to Apache Flume itself.
It's distributed however you decide to distribute it. Decide on your own topology and implement it.
You should think of Flume as a durable pipe. It has a source (you can choose from a number), a channel (you can choose from a number) and a sink (again, you can choose from a number). It is pretty typical to use an Avro sink in one agent to connect to an Avro source in another.
Assume you are installing Flume to gather Apache webserver logs. The common architecture would be to install Flume on each Apache webserver machine. You would probably use the Spooling Directory Source to get the Apache logs and the Syslog Source to get syslog. You would use the memory channel for speed and so as not to affect the server (at the cost of durability) and use the Avro sink.
That Avro sink would be connected, via Flume load balancing, to 2 or more collectors. The collectors would be Avro source, File channel and whatever you wanted (elasticsearch?, hdfs?) as your sink. You may even add another tier of agents to handle the final output.
In the latest version, Apache Flume no longer follows master-slave architecture. It is deprecated after Flume 1.x.
There is no longer a Master, and no Zookeeper dependency. Flume now runs with a simple file-based configuration system.
If we want it to scale, we need to install it in multiple physical nodes and run our own topology. As far as single node is considered. Say we hook to a JMS server that gives 2000 XML events per second, and I need two Fulme agents to get that data, I have two distributed options:<ol> <li>Two Flume agents started and running to get JMS data in same physical node.</li> <li>Two Flume agents started and running to get JMS data in two physical nodes.</li> </ol>