33016

Nutch problems executing crawl

Question:

I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in windows 7.

Nutch is running, I am getting results from running bin/nutch, but I keep getting error messages when I try to run a crawl.

I am getting the following error when I try to run a crawl execute with nutch:

Error running: /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl/crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/seed.txt

Failed with exit value 127.

I have my JAVA_HOME classpath set, and I have altered the host file to include the 127.0.0.1 as the localhost.

I am curious if I am calling the write directory correctly, if maybe that is the problem.

The full printout looks like:

User5@User5-PC /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local $ bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/ TestCrawl/ 2 Injecting seed URLs /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl//crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/ Injector: starting at 2015-12-23 17:48:21 Injector: crawlDb: TestCrawl/crawldb Injector: urlDir: C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls Injector: Converting injected urls to crawl db entries. Injector: java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:445) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833) at org.apache.nutch.crawl.Injector.inject(Injector.java:323) at org.apache.nutch.crawl.Injector.run(Injector.java:379) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.Injector.main(Injector.java:369) Error running: /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl//crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/ Failed with exit value 127.

The hadoop log that I think may have something to do with the error I am getting is:

2016-01-07 12:24:40,360 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326) at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:432) at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:478) at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170) at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64) at org.apache.nutch.crawl.Injector.main(Injector.java:369) 2016-01-07 12:24:40,450 ERROR crawl.Injector - Injector: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://localhost:8983/solr at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.<init>(Path.java:172) at org.apache.nutch.crawl.Injector.run(Injector.java:379) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.Injector.main(Injector.java:369) Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://localhost:8983/solr at java.net.URI$Parser.fail(URI.java:2848) at java.net.URI$Parser.checkChars(URI.java:3021) at java.net.URI$Parser.parse(URI.java:3048) at java.net.URI.<init>(URI.java:746) at org.apache.hadoop.fs.Path.initialize(Path.java:203) ... 4 more

Answer1:

You are running linux commands from Cygwin and there is no C:\ path in linux systems. Correct command should be something like

/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch1.11/runtime/local/bin/nutch inject TestCrawl/crawldb /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch1.11/runtime/local/urls/seed.txt

Answer2:

You have answer to your problem in this message:

<blockquote>

2016-01-07 12:24:40,360 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

</blockquote>

This is happening because hadoop version included with nutch 1.11 is designed to work in linux out of the box and not on windows.

I had same situation and I ended up using nutch1.11 in ubuntu virtual box.

Answer3:

<blockquote>

hadoop-core jar file is needed when you are working with nutch

</blockquote> <ul><li>with nutch 1.11 compatible hadoop-core jar is 0.20.0</li> <li><blockquote>

please download jar from this link : <a href="http://www.java2s.com/Code/Jar/h/Downloadhadoop0200corejar.htm" rel="nofollow">http://www.java2s.com/Code/Jar/h/Downloadhadoop0200corejar.htm</a>

</blockquote></li> <li><blockquote>

paste that jar into "C:\cygwin64\home\apache-nutch-1.11\lib" folder and it will run successfully.

</blockquote></li> </ul>

Recommend

  • Difference between @ModelAttribute and HttpServletRequest Attribute
  • Nutch Crawler error: Premission denied
  • Why String concatenate null with + operator and throws NullPointerException with concate() method
  • nutch crawling stops after injector.
  • Spring cloud - Resttemplate doesn't get injected in interceptor
  • tomcat 6 hangs after a few hours
  • Do I need to clean user input for DB::query calls in laravel?
  • Can't remove inline event handler in chrome
  • How to upload file to S3 from GAE (a horror story)
  • How to test method of JavaFX controller?
  • Is Android's FragmentTransaction.commit() Method Thread Safe?
  • How to “remove”/“change” some require(…) calls when using browserify?
  • NullPointer in Glassfish when inject JMS @Resource
  • Java null pointer exceptions - don't understand why
  • ASM ClassReader failed to parse class file - probably due to a new Java class file version that isn&
  • How to make a UserDetailsManager available as a bean
  • What components I need to create VS 2017 offline layout for UWP development?
  • How do I recognize a line break with a switch case that evaluates a char in Java?
  • Exception in the iconization of JInternalFrame with DefaultDesktopManager
  • Android - VerifyError
  • creating instance of object using reflection , when the constructor takes an array of strings as arg
  • jhipster run embedded jar with prod profile - issue with liquibase
  • Cassandra NoClassDefFoundError: com/google/common/util/concurrent/AsyncFunction
  • Invert string in Rust
  • What does certain JVM do after loading ByteCode into memory?
  • Glassfish - java.lang.NoClassDefFoundError
  • Embedded Glassfish JPA Datasource connection fail
  • Alamofire and Reachability.swift not working on xCode8-beta5
  • URLConnection doesn't work since API 10 and higher?
  • Implementing “partial void” in VB
  • Convert Type Decimal to Hex (string) in .NET 3.5
  • Access variable of ScriptContext using Nashorn JavaScript Engine (Java 8)
  • What's the purpose of QString?
  • What does 'Language neutral' mean with regard to MAKELANGID?
  • Problem deserializing objects from cache on MyBatis 3/Java
  • Display issues when we change from one jquery mobile page to another in firefox
  • Python - Map / Reduce - How do I read JSON specific field in using DISCO count words example
  • Hibernate gives error error as “Access to DialectResolutionInfo cannot be null when 'hibernate.
  • Angular 2 constructor injection vs direct access
  • coudnt use logback because of log4j