81798

how is flume distributed?

I am working with flume to ingest a ton of data into hdfs (about petabytes of data). I would like to know how is flume making use of its distributed architecture? I have over 200 servers and I have installed flume in one of them from where I would get the data from (aka data source) and the sink is the hdfs. (hadoop is running over serengeti in these servers). I am not sure whether flume distributes itself over the cluster or I have installed it incorrectly. I followed apache's user guide for flume installation and this post of SO.

How to install and configure apache flume?

http://flume.apache.org/FlumeUserGuide.html#setup

I am a newbie to flume and trying to understand more about it..Any help would be greatly appreciated. Thanks!!

Answer1:

I'm not going to speak to Cloudera's specific recommendations but instead to Apache Flume itself.

It's distributed however you decide to distribute it. Decide on your own topology and implement it.

You should think of Flume as a durable pipe. It has a source (you can choose from a number), a channel (you can choose from a number) and a sink (again, you can choose from a number). It is pretty typical to use an Avro sink in one agent to connect to an Avro source in another.

Assume you are installing Flume to gather Apache webserver logs. The common architecture would be to install Flume on each Apache webserver machine. You would probably use the Spooling Directory Source to get the Apache logs and the Syslog Source to get syslog. You would use the memory channel for speed and so as not to affect the server (at the cost of durability) and use the Avro sink.

That Avro sink would be connected, via Flume load balancing, to 2 or more collectors. The collectors would be Avro source, File channel and whatever you wanted (elasticsearch?, hdfs?) as your sink. You may even add another tier of agents to handle the final output.

Answer2:

In the latest version, Apache Flume no longer follows master-slave architecture. It is deprecated after Flume 1.x.

There is no longer a Master, and no Zookeeper dependency. Flume now runs with a simple file-based configuration system.

If we want it to scale, we need to install it in multiple physical nodes and run our own topology. As far as single node is considered. Say we hook to a JMS server that gives 2000 XML events per second, and I need two Fulme agents to get that data, I have two distributed options:

<ol> <li>Two Flume agents started and running to get JMS data in same physical node.</li> <li>Two Flume agents started and running to get JMS data in two physical nodes.</li> </ol>

Recommend

  • Confusion with the external tables in hive
  • pandas dataframe sum with groupby
  • Integrate ngrx into my code
  • Adding jars to the classpath of the code that launches map reduce job
  • Freeing a double pointer from a struct
  • Hive not detecting timestamp format
  • SML/NJ: How to use HashTable?
  • Rabin Karp Algorithm - How is the worst case O(m*n) for the given input?
  • Compressed file ingestion using Flume
  • Not able to open a portlet in liferay dialog
  • Hadoop streaming with python on Windows
  • can HBase , MapReduce and HDFS can work on a single machine having Hadoop installed and running on i
  • Best practice to handle default server and ip forwarding in nginx
  • RewriteCond match for certain Query param/value pair
  • Copy Access database query into Excel spreadsheet
  • Adding dependencies to a custom gradle plugin
  • How do I resolve an “invalid quantifier” error with regexp in javascript?
  • (XCode 4.0.2) Archive build (build for distribution on app-store) armv6 warning
  • psql.frame_query
  • iOS client server approach
  • Question regarding the ExtJS License [closed]
  • Posting to Facebook Page through API as the page (with admin rights)
  • How to use HTTP Authentication with PHP and then run the entered data against a database?
  • How can I encode a filename according to RFC 2231?
  • HikariPool-1 - Unusual system clock change detected, soft-evicting connections from pool
  • How do you SELECT several columns with one distinct column
  • C# 4 and CLR Compatibility
  • How to select table rows/complete table?
  • Multicolor tooltip in Qt
  • File extension of zlib zipped html page?
  • iOS Cordova first plugin - plugin.xml to inject a feature
  • How to estimate the Kalman Filter with 'KFAS' R package, with an AR(1) transition equation
  • Android activity accessing service's static reference before the service is ready
  • Problem deserializing objects from cache on MyBatis 3/Java
  • Cannot connect to cassandra from Spark
  • Matrix multiplication with MKL
  • Python: how to group similar lists together in a list of lists?
  • CSS Applying specific rule for a specific monitor resolution with only CSS is posible?
  • What are the advantages and disadvantages of reading an entire file into a single String as opposed
  • Converting MP3 duration time