Publication Date

Spring 2015

Document Type

Project Summary

Degree Name

Master of Science

Department

Computer Science

First Advisor

Soon-Ok Park, Ph.D.

Second Advisor

Stephen Hyzny, M.S.

Third Advisor

Michael Kelly, M.S.

Abstract

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.

Databases and Solr have complementary strengths and weaknesses. SQL supports very simple wildcard-based text search with some simple normalization like matching upper case to lower case. The problem is that these are full table scans. In Solr all searchable words are stored in an "inverse index", which searches orders of magnitude faster.

Solr is a standalone/cloud enterprise search server with a REST-like API. You put documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and receive XML, JSON, CSV or binary results. The project will be implemented using Amazon/Azure cloud, Apache Solr, Windows/Linux, MS-SQL server and open source tools.

Comments

Student ID number redacted by OPUS staff.

Share

COinS