16276

Selectively loading iis log files into Hive

Question:

I am just getting started with Hadoop/Pig/Hive on the cloudera platform and have questions on how to effectively load data for querying.

I currently have ~50GB of iis logs loaded into hdfs with the following directory structure:

<blockquote>

/user/oi/raw_iis/Webserver1/Org/SubOrg/W3SVC1056242793/ /user/oi/raw_iis/Webserver2/Org/SubOrg/W3SVC1888303555/ /user/oi/raw_iis/Webserver3/Org/SubOrg/W3SVC1056245683/

etc

</blockquote>

I would like to load all the logs into a Hive table.

I have two issues/questions:

1.

My first issue is that some of the webservers may not have been configured correctly and will have iis logs without all columns. These incorrect logs need additional processing to map the available columns in the log to the schema that contains all columns.

The data is space delimited, the issue is that when not all columns are enabled, the log only includes the columns enabled. Hive cant automatically insert nulls since the data does not include the columns that are empty. I need to be able to map the available columns in the log to the full schema.

Example good log:

#Fields: date time s-ip cs-method cs-uri-stem useragent 2013-07-16 00:00:00 10.1.15.8 GET /common/viewFile/1232 Mozilla/5.0+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/27.0.1453.116+Safari/537.36

Example log with missing columns (cs-method and useragent):

#Fields: date time s-ip cs-uri-stem 2013-07-16 00:00:00 10.1.15.8 /common/viewFile/1232

The log with missing columns needs to be mapped to the full schema like this:

#Fields: date time s-ip cs-method cs-uri-stem useragent 2013-07-16 00:00:00 10.1.15.8 null /common/viewFile/1232 null

How can I map these enabled fields to a schema that includes all possible columns, inserting blank/null/- token for fields that were missing? Is this something I could handle with a Pig script?

2.

How can I define my Hive tables to include information from the hdfs path, namely Org and SubOrg in my dir structure example so that it is query-able in Hive? I am also unsure how to properly import data from the many directories into a single hive table.

Answer1:

First provide Sample data for better help.

<em><strong>How can I map these enabled fields to a schema that includes all possible columns, inserting blank/null/- token for fields that were missing?</strong></em>

If you have delimiter in file you can use Hive and hive automatically inserts nulls properly wherever data is not there.provided that you do not have delimiter as part of your data.

<em><strong>Is this something I could handle with a Pig script?</strong></em>

If you have delimiter among the fields then you can use Hive ,otherwise you can go for mapreduce/pig.

<em><strong>How can I include information from the hdfs path, namely Org and SubOrg in my dir structure example so that it is query-able in Hive?</strong></em>

Seems you are new bee in hive,before querying you have to <a href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable" rel="nofollow">create a table</a> which includes information like path,delimiter and schema.

<strong>Is this a good candidate for partitioning?</strong>

You can apply partition on date if you wish.

Answer2:

I was able to solve both my issues with Pig UDF (user defined functions)

<ol><li>Mapping columns to proper schema: See this <a href="https://stackoverflow.com/questions/18295382/extract-first-line-of-csv-file-in-pig/18345721#18345721" rel="nofollow">answer</a> and this <a href="https://stackoverflow.com/a/18346066/445291" rel="nofollow">one</a>. </li> </ol>

All I really had to do is add some logic to handle the iis headers that start with #. Below are the snippets from getNext() that I used, everything else is the same as mr2ert's example code.

See the values[0].equals("#Fields:") parts.

@Override public Tuple getNext() throws IOException { ... Tuple t = mTupleFactory.newTuple(1); // ignore header lines except the field definitions if(values[0].startsWith("#") && !values[0].equals("#Fields:")) { return t; } ArrayList<String> tf = new ArrayList<String>(); int pos = 0; for (int i = 0; i < values.length; i++) { if (fieldHeaders == null || values[0].equals("#Fields:")) { // grab field headers ignoring the #Fields: token at values[0] if(i > 0) { tf.add(values[i]); } fieldHeaders = tf; } else { readField(values[i], pos); pos = pos + 1; } } ... } <ol start="2"><li>

To include information from the file path, I added the following to my LoadFunc UDF that I used to solve 1. In the prepareToRead override, grab the filepath and store it in a member variable.

public class IISLoader extends LoadFunc { ... @Override public void prepareToRead(RecordReader reader, PigSplit split) { in = reader; filePath = ((FileSplit)split.getWrappedSplit()).getPath().toString(); } </li> </ol>

Then within getNext() I could add the path to the output tuple.

Recommend

  • Manipulate row data in hadoop to add missing columns
  • Resampling of time signal in MATLAB
  • Object file directory per compiler option combinations
  • postgresql trigger to make name unique
  • Using Url.Action in javascript
  • How to add annotation on each facet
  • Where is arguments property defined?
  • Right way to use __attribute__((NSObject)) with ARC?
  • Get twitter friends list?
  • How do I address a UNC path in Ruby on Windows?
  • difference between embedded and standalone activemq broker
  • Questions on Regex algorithm (not necessarily EXPECT related)
  • Why bgp open and notification packets is only flowing in ODL
  • MongoDB - how to insert record using autoincrement functionality
  • setPreviewSize() bug in Android?
  • rails dev log output - how can I stop the GETs for subsequent assets (e.g. javascript, images) getti
  • JSchException: UnknownHostKey
  • Samsung Galaxy Tab 10.1 and -webkit-tap-highlight-color style?
  • Hatch area using pcolormesh in Basemap
  • An unexpected error has been detected by HotSpot Virtual Machine
  • java.net.SocketTimeoutException on embedded tomcat with jersey
  • Google OAuth2 for an web application hosted behind NAT (intranet server without public IP)
  • Why isn't \\\\ interpreted as a blackslash in this regex?
  • How to resolve docker host names (/etc/hosts) in containers
  • NRefactory: How do I access unresolved Named Arguments on a Property Attribute?
  • Should I be afraid to use UDP to make a client/server broadcast talk?
  • Varnish/Apache Random 503 Errors
  • How to implement Deep Linking in Roku SG application?
  • Android Google Maps API v2 start navigation
  • How to make jdk.nashorn.api.scripting.JSObject visible in plugin [duplicate]
  • System.InvalidCastException: Specified cast is not valid
  • Dialing with Intent.ACTION_CALL stopps at # in phone number
  • Cannot resolve symbol 'MyApi'
  • Installing Hadoop, Java Exception about illegal characters at index 7?
  • Different response to non-authenticated users and AJAX calls
  • Trying to switch camera back to front but getting exception
  • SetUp method failed while running tests from teamcity
  • Free memory of cv::Mat loaded using FileStorage API
  • Angular 2 constructor injection vs direct access
  • Programmatically clearing map cache