Norconex HTTP Collector
Developer(s) | Norconex Inc. |
---|---|
Stable release |
2.x
|
Development status | Active |
Written in | Java |
Operating system | Cross-platform |
Type | Web Crawler |
License | Apache |
Website |
www |
Norconex HTTP Collector is a web spider, or crawler initially created for Enterprise Search integrators and developers. It began as a closed source project developed by Norconex. It was released as open source in 2013.[1][2][3][4][5]
Architecture
Norconex HTTP Collector was built entirely using Java. A single Collector installation is responsible for launching one or multiple crawler threads, each with their own configuration.
Each step is part of a crawler life-cycle is configurable and overwritable. Developers can provide their own interface implementation for most steps undertaken by the crawler. The default implementations provided cover a vast array of crawling use cases, and are built on stable products such as Apache Tika and Apache Derby. The following figure is a high level representation of a URL-life-cycle from the crawler perspective.
The Importer and Committer modules are separate Apache licensed java libraries distributed with the Collector.
The Importer module parses incoming document from their raw form (HTML, PDF, Word, etc) to a set of extracted metadata and plain text content. In addition, it provides interfaces to manipulate a document metadata, transform its content, or simply filter the documents based on their new format. While the Collector is heavily dependent on the Importer module, the later can be used on its own, as a general-purpose document parser.
The committer module is responsible for directing the parsed data to a target repository of choice. Developers are able to write custom implementations, allowing the use of Norconex HTTP Collector with any search engines or repositories. Two committer implementations currently exists, for Apache Solr and Elastic Search.
Minimum Requirements
Java Standard Edition 7.0 or higher is required. Runs on any platform supporting Java.
Configuration
While the Norconex HTTP Collector can be configured programmatically it also supports XML configuration files. Apache Velocity is used to parse configuration files. Using Velocity directives permits configuration re-use amongst different Collector installations and variables substitution.
<httpcollector id="Minimum Config HTTP Collector"> <progressDir>./examples-output/minimum/progress</progressDir> <logsDir>./examples-output/minimum/logs</logsDir> <crawlers> <crawler id="Norconex Minimum Test Page"> <startURLs> <url>http://www.norconex.com/product/collector-http-test/minimum.php</url> </startURLs> <workDir>./examples-output/minimum</workDir> <maxDepth>0</maxDepth> <delay default="5000" /> <referenceFilters> <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" > http://www\.norconex\.com/product/collector-http-test/.* </filter> </referenceFilters> <importer> <postParseHandlers> <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger" fields="title,keywords,description,document.reference"/> </postParseHandlers> </importer> <committer class="com.norconex.committer.core.impl.FileSystemCommitter"> <directory>./examples-output/minimum/crawledFiles</directory> </committer> </crawler> </crawlers> </httpcollector>