Apache Solr

3y ago
20 Views
2 Downloads
2.41 MB
19 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Kairi Hasson
Transcription

Apache SolrDAPI . Information Description, Storage and Retrieval CourseMIEIC, 2020/21 EditionSérgio NunesDEI, FEUP, U.PortoWork in progress

Plan for Today Questions? Groups Presentations ( 90 min) Break Milestone #2 Overview Solr overview2

Milestone #2 — Information Retrieval3

Milestone #2 Goal: index dataset to support querying using free-text Use open-source tools (i.e. Solr); decide on the document granularity; decide on the search filters. Expected actions: Choose the information retrieval tool (Solr, Lucene, Terrier, Elasticsearch, ); Analyze the documents and identify their indexable components; Identify search parameters that will be offered to the users; Use the tool API to generate indexes; Use the tool API to configure the answer to queries; Demonstrate the indexing and retrieval processes; Evaluate the results, (ideally) comparing different ranking formulas. More information at https://web.fe.up.pt/ ssn/dokuwiki/teach/dapi/202021/delivery2/index4

Search Engine Overview5

Architecture of a Search Engine Two primary goals of a search engine: effectiveness (quality) — retrieve the most relevant set of documents; efficiency (speed) — present the results as quickly as possible; Search engines are architected to two support two major functions: indexing process — build the structures to enable search; querying process —use the structures to produce a ranking;6

The Indexing Process7

Blocks of the Indexing Process Text Acquisition crawler; conversion; document data store. Text Transformation parser; stopping; stemming; link extraction; information extraction; classifier. Index Creation document statistics; weighting; inversion; index distribution;8

The Querying Process9

Blocks of the Querying Process User Interaction query input; query transformation; results output. Ranking scoring; performance optimization; distribution. Evaluation logging; ranking analysis; performance analysis.10

Apache Solr11

Apache Solr Solr is a search server built on top of Apache Lucene, an open source, Java-based,information retrieval library. Standard steps: Define the schema, to tell Solr about the contents of documents it will be indexing; Feed Solr documents for which your users will search; Expose search functionality in your application. Solr offers support for the simplest keyword searching through to complex queries on multiplefields and faceted search results. Because Solr is based on open standards, it is highly extensible. Solr queries are simple HTTPrequest URLs and the response is a structured document: mainly JSON, but it could also beXML, CSV, or other formats.From: Apache Solr Reference Guide. https://lucene.apache.org/solr/guide12

Example Solr Integration13

Solr FeaturesFrom: Trey Grainger and Timothy Potter. Solr in Action, Manning Publications, 2014.

The Inverted Index (again)From: Trey Grainger and Timothy Potter. Solr in Action, Manning Publications, 2014.

Solr Command Line Tool

Solr Admin Console

Tasks Finish and submit Milestone #1 report. Review goals and organize work for Milestone #2. Experiment with full-text indexing tools. Apache Solr Tutorial — .html Experiment with other collections (e.g. project, personal documents, etc). Anticipate indexing and search tasks on the working dataset. Next week: finish and submit Milestone #1 report.18

References Apache. Solr Tutorial. .html Apache Solr Reference Guide. https://lucene.apache.org/solr/guide/ Trey Grainger and Timothy Potter. Solr in Action, Manning Publications, 2014. W. Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: InformationRetrieval in Practice, Pearson, 2009. http://ciir.cs.umass.edu/downloads/SEIRiP.pdf19

Solr in Action, Manning Publications, 2014. Solr Command Line Tool. Solr Admin Console. Tasks Finish and submit Milestone #1 report. Review goals and organize work for Milestone #2. Experiment with full-text indexing tools.

Related Documents:

Getting Started with the Cloud . Apache Bigtop Apache Kudu Apache Spark Apache Crunch Apache Lucene Apache Sqoop Apache Druid Apache Mahout Apache Storm Apache Flink Apache NiFi Apache Tez Apache Flume Apache Oozie Apache Tika Apache Hadoop Apache ORC Apache Zeppelin

SoLR The SoLR receives a .CSV file containing a list of MPRN's of affected Customers. 15 SoLR identify list of Pay As You Go (PAYG) MPRNs SoLR PAYG MPRNs are prioritised to be transferred to the SoLR to minimise administrative complications that could arise if there is a transfer delay.

Autoselect Solr core using SolrTemplate. Assert upwards compatibility with Apache Solr 6 (incl. 6.3). Support for combined Facet and Highlight Query. Allow reading single valued multivalue fields into non collection property. Use native SolrJ schema api. 1.2. What's new in Spring Data for Apache Solr 2.0 Upgrade to .

solr/data/index Master solr/data/index Searcher new segment solr/data/snapshot-2006062950000 1. hard links solr/data/snapshot-2006062950000-WIP 2. hard links 3. rsync 4. mv dir Lucene index segments after mv after rsync for performance reasons rsync is instructed to use timestamps & file sizes to determine which files have changed.

CDH: Cloudera’s Distribution Including Apache Hadoop Coordination Data Integration Fast Read/Write Access Languages / Compilers Workflow Scheduling Metadata APACHE ZOOKEEPER APACHE FLUME, APACHE SQOOP APACHE HBASE APACHE PIG, APACHE HIVE APACHE OOZIE APACHE OOZIE APACHE HIVE File System Mount UI

Solr is continually improving. Solr 4 was recently released, bringing dramatic changes in the underlying Lucene library and Solr-level features. It's tough for us all to keep up with the various versions and capabilities. This talk will blaze through the highlights of new features and improvements in Solr 4 (and up).

ranger.audit.solr.config.ttl Time To Live for Solr Collection of Ranger Audits 90 days ranger.audit.solr.config.delete.triggerAuto Delete Period in seconds for Solr Collection of Ranger Audits for expired documents 1 days (configurable) Note: "Time To Live for Solr Collection of Ranger Audits" is also known as the Max Retention Days attribute .

Apache Solr Search by ASF "Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF)