我是新来的爬行 . 我正在使用https://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website来执行nutch 1.12的爬行 . 我在Windows上使用Cygwin进行了设置 .

“bin / nutch”命令运行正常,但要抓取我做了以下更改 -

  • 这是我的conf / nutch-site.xml文件
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
	 <name>http.agent.name</name>
	 <value>My Nutch Spider</value>
	</property>
</configuration>
  • 这是我创建的urls / seed.txt文件的内容

https://www.drugs.com/

现在当我运行以下命令 bin/nutch inject crawl/crawldb urls 时,我得到nullPointerException,如下所示

MithL@DESKTOP-K3INBH0 /home/apache-nutch-1.12
$ bin/nutch inject crawl/crawldb urls
Injector: starting at 2017-02-21 14:03:51
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
        at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
        at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
        at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
        at org.apache.nutch.crawl.Injector.run(Injector.java:467)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:441)

请建议我应该做什么 . 谢谢

!UPDATE!

我将hadoop-core-1.2.1.jar添加到apache-nutch-1.12 / lib文件夹中,并将HADOOP_HOME环境变量设置为 C:\winutils\bin\winutils.exe

现在它给出了UnsupportedOperationException,如下所示

$ bin/nutch inject crawl/crawldb urls
Injector: starting at 2017-02-21 21:37:32
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation
        at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:214)
        at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2365)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:347)
        at org.apache.nutch.crawl.Injector.run(Injector.java:467)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:441)

请建议我应该做什么 . 谢谢