60858

Nutch - Getting Error: JAVA_HOME is not set. when trying to crawl

Question:

First and foremost I'm a Nutch/Hadoop newbie. I have installed Cassandra. I have installed Nutch on the Master node of my EMR cluster. When I attempt to execute a crawl using the following command:

sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5

I get

Error: JAVA_HOME is not set.

If I run the command without 'sudo' I get:

Injector: starting at 2014-07-16 02:12:24 Injector: crawlDb: urls/crawldb Injector: urlDir: crawl Injector: Converting injected urls to crawl db entries. Injector: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hadoop/apache-nutch-1.8/runtime/local/crawl at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.Injector.inject(Injector.java:279) at org.apache.nutch.crawl.Injector.run(Injector.java:316) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:306)

I cannot figure this out. I've seen the other forum here: <a href="https://stackoverflow.com/questions/17374028/cant-get-apache-nutch-to-crawl-permissions-and-java-home-suspected" rel="nofollow">Similar Topic</a>

and followed it to no avail. I have added

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

and

export PATH=$PATH:${JAVA_HOME}/bin

to my ~/.bashrc and I am using Linux..

Any help will be appreciated!!

Answer1:

The problem is I was running

sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5

I used

bin/crawl ./urls/seed.txt TestCrawl http://localhost:8983/solr/ 5

And all is well, just a malformed command.. i.e. 'crawl' is deprecated as stated here: <a href="http://wiki.apache.org/nutch/NutchTutorial" rel="nofollow">Apache Nutch Tutorial</a>

Recommend

  • Nutch Crawler error: Premission denied
  • nutch crawling stops after injector.
  • Nutch problems executing crawl
  • java.io.IOException: File /tmp/hadoop-eo/mapred/system/jobtracker.info could only be replicated to 0
  • Hadoop streaming with zip input files
  • Hadoop Number of Reducers Configuration Options Priority
  • Running a Hadoop Job From another Java Program
  • How to force Hadoop to unzip inputs regadless of their extension?
  • Rate limiting in Google Cloud Storage
  • WEBHDFS REST API to copy/move files from windows server/local folder/desktop to HDFS
  • Hadoop shuffle uses which protocol?
  • Swift string variables localization
  • How to use ResourceDictionary in Windows Phone class library project
  • MongoDB GeoJSON “Can't extract geo keys from object, malformed geometry?” when inserting type P
  • It is possible use the same sql azure instance from two different cloud service of two different sub
  • Count from each distinct date, fill in missing dates with zero
  • py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment
  • Glassfish - java.lang.NoClassDefFoundError
  • How gzip file gets stored in HDFS
  • MVC - @Html.CheckBoxFor
  • Get used tables from sql query [duplicate]
  • pymongo replication secondary readreference not work
  • File extension of zlib zipped html page?
  • iOS Cordova first plugin - plugin.xml to inject a feature
  • sweetalert2 inputoptions from file in select example
  • How can I get the choice “H2” back in the H2 consol?
  • Open Existing DB in MySQL WorkBench
  • How do I shift the decimal place in Python?
  • How can I extract results of aggregate queries in slick?
  • Needing to do .toArray() to get output of mongodb .find() on key name not value
  • Problem deserializing objects from cache on MyBatis 3/Java
  • Installing Apache MyFaces 2 on WildFly 8.2.0
  • Read a local file using javascript
  • Django: Count of Group Elements
  • ImageMagick, replace semi-transparent white with opaque white
  • MongoDB in PHP using aggregate to group by _id is null not working
  • Cannot connect to cassandra from Spark
  • Cross-Platform Protobuf Serialization
  • Alternatives to the OPTIONAL fallback SPARQL pattern?
  • Turn off referential integrity in Derby? is it possible?