$ bin/nutch inject
Usage: InjectorJob <url_dir> [-crawlId <id>]
$ bin/nutch inject urls
InjectorJob: starting at 2014-12-20 22:32:01
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2014-12-20 22:32:15, elapsed: 00:00:14
其中urls/seed.txt的内容如下: http://stackoverflow.com/ (2)查看注入的url 上述步骤会在hbase中新建一个表,表名为test_1_webpage,url的相应内容会写入这张表 hbase(main):002:0> scan '334_webpage' ROW COLUMN+CELL com.stackoverflow:http/ column=f:fi, timestamp=1408953100271, value=\x00'\x8D\x00 com.stackoverflow:http/ column=f:ts, timestamp=1408953100271, value=\x00\x00\x01H\x0C&\x11\x8D com.stackoverflow:http/ column=mk:_injmrk_, timestamp=1408953100271, value=y com.stackoverflow:http/ column=mk:dist, timestamp=1408953100271, value=0 com.stackoverflow:http/ column=mtdt:_csh_, timestamp=1408953100271, value=?\x80\x00\x00 com.stackoverflow:http/ column=s:s, timestamp=1408953100271, value=?\x80\x00\x00 1 row(s) in 0.3020 seconds (3)关于**_webpage表 对于每一个任务,均会生成一个crawlId_webpage的表,所有已抓取及未抓取的url相关信息均会存入此表。 若url未抓取,则该url相应的行信息较少。若url已经抓取,则抓取到的内容也会放入该行,如网页内容等。 2、GeneratorJob (1)基本命令 [jediael@jediael local]$ bin/nutch generate -crawlId 334 GeneratorJob: starting at 2014-08-25 15:57:12 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: true GeneratorJob: normalizing: true GeneratorJob: finished at 2014-08-25 15:57:18, time elapsed: 00:00:06 GeneratorJob: generated batch id: 1408953432-1171377744 (2)命令选项 [root@jediael local]# bin/nutch generate Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays] -topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE -crawlId <id> - the id to prefix the schemas to operate on, default: storage.crawl.id)"); -noFilter - do not activate the filter plugin to filter the url, default is true -noNorm - do not activate the normalizer plugin to normalize the url, default is true -adddays - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0. -batchId - the batch id ---------------------- Please set the params. (3)查看数据库 hbase(main):003:0> scan '334_webpage' ROW COLUMN+CELL com.stackoverflow:http/ column=f:bid, timestamp=1408953437910, value=1408953432-1171377744 com.stackoverflow:http/ column=f:fi, timestamp=1408953100271, value=\x00'\x8D\x00 com.stackoverflow:http/ column=f:ts, timestamp=1408953100271, value=\x00\x00\x01H\x0C&\x11\x8D com.stackoverflow:http/ column=mk:_gnmrk_, timestamp=1408953437910, value=1408953432-1171377744 com.stackoverflow:http/ column=mk:_injmrk_, timestamp=1408953100271, value=y com.stackoverflow:http/ column=mk:dist, timestamp=1408953100271, value=0 com.stackoverflow:http/ column=mtdt:_csh_, timestamp=1408953100271, value=?\x80\x00\x00 com.stackoverflow:http/ column=s:s, timestamp=1408953100271, value=?\x80\x00\x00 1 row(s) in 0.0490 seconds 此步骤新增了f:bid,mk:_gnmrk_ 两列。 3、FetcherJob (1)基本命令 [jediael@jediael local]$ bin/nutch generate -crawlId 334 GeneratorJob: starting at 2014-08-25 15:57:12 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: true GeneratorJob: normalizing: true GeneratorJob: finished at 2014-08-25 15:57:18, time elapsed: 00:00:06 GeneratorJob: generated batch id: 1408953432-1171377744 [jediael@jediael local]$ bin/nutch fetch -all -crawlId 334 FetcherJob: starting FetcherJob: fetching all Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost Fetcher: threads: 10 QueueFeeder finished: total 1 records. Hit by time limit :0 Fetcher: throughput threshold: -1 Fetcher: throughput threshold sequence: 5 fetching http://stackoverflow.com/ (queue crawl delay=5000ms) -finishing thread FetcherThread1, activeThreads=8 -finishing thread FetcherThread7, activeThreads=7 -finishing thread FetcherThread6, activeThreads=6 -finishing thread FetcherThread5, activeThreads=5 -finishing thread FetcherThread4, activeThreads=4 -finishing thread FetcherThread3, activeThreads=3 -finishing thread FetcherThread2, activeThreads=2 -finishing thread FetcherThread8, activeThreads=1 -finishing thread FetcherThread9, activeThreads=1 -finishing thread FetcherThread0, activeThreads=0 0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 102 102 kb/s, 0 URLs in 0 queues -activeThreads=0 FetcherJob: done (2)查看数据库 见db1.txt 新增f:bas,column=f:cnt,column=f:prot,f:pts,f:st,f:ts,f:typ,h:Cache-Control,h:Connection,h:Content-Encoding,h:Content-Length, h:Content-Type,h:Date,h:Expires, h:Last-Modified,h:Set-Cookie,h:Vary,h:X-Frame-Options, mk:_ftcmrk_等字段 4、ParserJob (1)基本命令 [jediael@jediael local]$ bin/nutch parse -all -crawlId 334 ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse: false ParserJob: parsing all Parsing http://stackoverflow.com/ ParserJob: success (2)命令参数 [root@jediael local]# bin/nutch parse Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force] <batchId> - symbolic batch ID created by Generator -crawlId <id> - the id to prefix the schemas to operate on, (default: storage.crawl.id) -all - consider pages from all crawl jobs -resume - resume a previous incomplete job -force - force re-parsing even if a page is already parsed (3)查看数据库 见db_parse.txt 新增了很多类似column=ol:http://stackoverflow.com/help的列,在此例中共有115个。 5、DbUpdaterJob (1)基本命令 [jediael@jediael local]$ bin/nutch updatedb -crawlId 334 DbUpdaterJob: starting DbUpdaterJob: done (2)查看数据库 见db_updatedb.txt 解释了上述的115个column=ol:http,并生成了115行新数据,举其中一个例子如下: com.stackoverflow:http/users/39 column=f:fi, timestamp=1408954979355, value=\x00'\x8D\x00 44974/silviu-oncioiu com.stackoverflow:http/users/39 column=f:st, timestamp=1408954979355, value=\x00\x00\x00\x01 44974/silviu-oncioiu com.stackoverflow:http/users/39 column=f:ts, timestamp=1408954979355, value=\x00\x00\x01H\x0CB\xD4\x09 44974/silviu-oncioiu com.stackoverflow:http/users/39 column=mk:dist, timestamp=1408954979355, value=1 44974/silviu-oncioiu com.stackoverflow:http/users/39 column=mtdt:_csh_, timestamp=1408954979355, value=<\x0Ex5 44974/silviu-oncioiu com.stackoverflow:http/users/39 column=s:s, timestamp=1408954979355, value=<\x0Ex5 44974/silviu-oncioiu com.stackoverflow:http/users/39 column=f:fi, timestamp=1408954979355, value=\x00'\x8D\x00 74525/laosi com.stackoverflow:http/users/39 column=f:st, timestamp=1408954979355, value=\x00\x00\x00\x01 74525/laosi com.stackoverflow:http/users/39 column=f:ts, timestamp=1408954979355, value=\x00\x00\x01H\x0CB\xD4\x09 74525/laosi com.stackoverflow:http/users/39 column=mk:dist, timestamp=1408954979355, value=1 74525/laosi com.stackoverflow:http/users/39 column=mtdt:_csh_, timestamp=1408954979355, value=<\x0Ex5 74525/laosi com.stackoverflow:http/users/39 column=s:s, timestamp=1408954979355, value=<\x0Ex5 74525/laosi 此时数据已准备好,等待下一轮的抓取。 6、SolrIndexerJob (1)基本命令 [jediael@jediael local]$ bin/nutch solrindex http://****/solr/ -all -crawlId 334 SolrIndexerJob: starting Adding 1 documents SolrIndexerJob: done. (2)命令参数 [root@jediael local]# bin/nutch solrindex Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>] (3)查看数据库 无变化转载于:https://www.cnblogs.com/jinhong-lu/p/4559392.html
相关资源:nutch2.2.1