What is Nutch?
Nutch is an effort to build a Free and Open Source search engine. It uses Lucene for the search and index component. The fetcher (robot) has been written from scratch solely for this project.
Nutch has a highly modular architecture allowing developers to create plug-ins for activities such as media-type parsing, data retrieval, querying and clustering.
Doug Cutting is the lead developer of Nutch.
What is Lucene?
Lucene is a Free and Open Source search and index API released by the Apache Software Foundation. It is written in Java and is released under the Apache Software License.
Lucene is just the core of a search engine. As such, it does not include things like a web spider or parsers for different document formats. Instead these things need to be added by a developer who uses Lucene.
Lucene does not care about the source of the data, its format, or even its language, as long as you can convert it to text. This means you can use Lucene to index and search data stored in files: web pages on remote web servers, documents stored in local file systems, simple text files, Microsoft Word documents, HTML or PDF files, or any other format from which you can extract textual information.
Lucene has been ported or is in the process of being ported to various programming languages other than Java:
Lucene4c - C
CLucene - C++
MUTIS - Delphi
NLucene - .NET
DotLucene - .NET
Plucene - Perl
Pylucene - Python
Ferret and RubyLucene – Ruby
More ....
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine
Wednesday, December 26, 2007
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment