"beginning" }}filter { if [pat" name="description" /> "beginning" }}filter { if [pat" />
32126

Processing a Warc File using Logstash, ElasticSearch, and Kibana

Question:

I would like to parse a WARC file using LogStash. I want to feed the input to ElasticSearch, so that I can visualize it using Kibana. I have tried this:

input { file { path => "/tmp/access_log" start_position => "beginning" } } filter { if [path] =~ "access" { mutate { replace => { "type" => "apache_access" } } grok { match => { "message" => "%{COMBINEDAPACHELOG}" } } } date { match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ] } } output { elasticsearch { hosts => ["localhost:9200"] } stdout { codec => rubydebug } }

This help to take an apache log and display it. I would like to know how is it possible to use the WARC file and visualize it using the Kibana.<br /> This is sample WARC file that I would like to input.

WARC/0.17 WARC-Type: metadata WARC-Target-URI: http://www.archive.org/robots.txt WARC-Date: 2008-04-30T20:48:25Z WARC-Concurrent-To: <urn:uuid:e7c9eff8-f5bc-4aeb-b3d2-9d3df99afb30> WARC-Record-ID: <urn:uuid:545709ad-90c5-4c08-9eed-092bdf2e33a7> Content-Type: text/anvl Content-Length: 66 via: http://www.archive.org/ hopsFromSeed: P fetchTimeMs: 47 WARC/0.17 WARC-Type: response WARC-Target-URI: http://www.archive.org/ WARC-Date: 2008-04-30T20:48:26Z WARC-Payload-Digest: sha1:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV WARC-IP-Address: 207.241.229.39 WARC-Record-ID: <urn:uuid:4042c21b-d898-43f0-9c95-b50da2d1aa42> Content-Type: application/http; msgtype=response Content-Length: 680 HTTP/1.1 200 OK Date: Wed, 30 Apr 2008 20:48:25 GMT Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g Last-Modified: Wed, 09 Jan 2008 23:18:29 GMT ETag: "47ac-16e-4f9e5b40" Accept-Ranges: bytes Content-Length: 366 Connection: close Content-Type: text/html; charset=UTF-8 <html> <head> <meta http-equiv="Refresh" content="0;URL=http://www.archive.org/index.php"/> <script> document.location="http://www.archive.org/index.php"; </script> </head> <body> <img width="70" height="56" src="http://www.archive.org/images/logoc.jpg"/><br/> Please visit our website at: <a href="http://www.archive.org">http://www.archive.org</a> </body> </html>

Here is the full Sample of File: <a href="https://drive.google.com/file/d/0BzQ6rtO2VN95WUV3YmhUX1JuYkk/view?usp=sharing" rel="nofollow">Sample WARC Text in Text File Format</a><br /> Hope to hear from you soon.. I will be glad if I get this query resolved.

Answer1:

This filter will keep only the lines with "^WARC-Target-URI" or "^HTTP/1.1" or "^Date: ", then extract information from the lines.

input { file { path => "/tmp/access_log" start_position => "beginning" } } filter { if [message] !~ "^WARC-Target-URI" and [message] !~ "^HTTP\/1.1" and [message] !~ "^Date: " { drop {} } grok { match => { "message" => ["Date: %{GREEDYDATA:date}", "WARC-Target-URI: %{GREEDYDATA:url}", "HTTP/1.1 %{NUMBER:response}"] } } # For "Wed, 30 Apr 2008 20:48:25 GMT" date { match => ["date", "EEE, dd MMM YYYY HH:mm:ss ZZZ"] target => "date" locale => "en" } } output { elasticsearch { hosts => ["localhost:9200"] index => "webinfo" } }

From the sample file, it will insert in Elasticsearch the following json documents:

{"message":"WARC-Target-URI: http://www.archive.org/robots.txt","@version":"1","@timestamp":"2016-11-22T12:55:48.151Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/robots.txt"} {"message":"WARC-Target-URI: http://www.archive.org/","@version":"1","@timestamp":"2016-11-22T12:55:48.151Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/"} {"message":"HTTP/1.1 200 OK","@version":"1","@timestamp":"2016-11-22T12:55:48.167Z","path":"D:\\better.txt","host":"FREIFDKT0021127","response":"200"} {"message":"Date: Wed, 30 Apr 2008 20:48:25 GMT","@version":"1","@timestamp":"2016-11-22T12:55:48.167Z","path":"D:\\better.txt","host":"FREIFDKT0021127","date":"2008-04-30T20:48:25.000Z"} {"message":"WARC-Target-URI: http://www.archive.org/","@version":"1","@timestamp":"2016-11-22T12:55:48.183Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/"} {"message":"WARC-Target-URI: http://www.archive.org/","@version":"1","@timestamp":"2016-11-22T12:55:48.183Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/"} {"message":"WARC-Target-URI: http://www.archive.org/index.php","@version":"1","@timestamp":"2016-11-22T12:55:48.183Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/index.php"} {"message":"HTTP/1.1 200 OK","@version":"1","@timestamp":"2016-11-22T12:55:48.198Z","path":"D:\\better.txt","host":"FREIFDKT0021127","response":"200"} {"message":"Date: Wed, 30 Apr 2008 20:48:25 GMT","@version":"1","@timestamp":"2016-11-22T12:55:48.198Z","path":"D:\\better.txt","host":"FREIFDKT0021127","date":"2008-04-30T20:48:25.000Z"}

Recommend

  • use grep and awk to transfer data from .srt to .csv/xls
  • Error “Non-repeated field already set.” when loading from Datastore into BigQuery
  • How can we get list of non-system users on linux?
  • average time in a column in hr:min:sec format
  • What's “msgid” and “xliff” in strings.xml file?
  • RFX equivalent data type for _int64 in Informix
  • Add a TCombobox Column to a Firemonkey TGrid
  • TelephonyManager crashing on android studio
  • Making turtles wait x number of ticks
  • Selecting TOP 4 records from multiple SQL Server tables. Using vb.net
  • How can i filter mysql data if a column has multiple comma separated values?
  • How do I get an Option instead of an Option from a Diesel query which only returns 1 or 0 records?
  • How to access gmail API?
  • ASP Error 0223 - TypeLib Not Found, intermittent, resolved after IIS restart
  • Nanoseconds lost coming from MongoDB ISODate Object
  • How can I sum two different columns at once where one contains Decimal objects in pandas?
  • PHP mail() function not delivering mail
  • Socket io in node app on google app engine
  • Select count of rows that have a certain number of rows in a related table
  • Update SQL MS Access 2010
  • Multiple versions of iTunesArtwork in one project?
  • Set the default timezone in symfony
  • NHibernate proxyexception
  • TSQL Rolling Average of Time Groupings
  • SQL - Select lowest values with group by and order by?
  • Bash if statement with multiple conditions
  • Copy to all folders batch file?
  • JBoss External Properties Files in Classpath
  • Excel's Macro-Recorder usage
  • How can the INSERT … ON CONFLICT (id) DO UPDATE… syntax be used with a sequence ID?
  • Marklogic : Query response time is very high
  • Why querying a date BC is changed to AD in Java?
  • Record samples being played with OpenAL
  • json Serialization in asp
  • Rails 2: use form_for to build a form covering multiple objects of the same class
  • How to stop GridView from loading again when I press back button?
  • costura.fody for a dll that references another dll
  • Observable and ngFor in Angular 2
  • UserPrincipal.Current returns apppool on IIS
  • java string with new operator and a literal