Architecture

solr应用的部署、运行方式

http://wiki.apache.org/solr/SolrInstall#Setup 按这个说法,似乎要部署的对象有两份:一份solr.war,一份solr.home指向的solr应用. 在maven + svn 环境下,这种东西要怎么部署,需要好好想想。

solr不是一个jar库,而是一个java web app应用

solr不是一个jar库,而是一个java web app应用. 你的系统一般不是引入一个库,而是要与新搭建的solr webapp进行远程通信。 运行在servlet容器里的solr,在使用方式上相当于一个数据库,是独立的。 如果你一定要把solr当成jar库来用也可以,官方提供了一个"EmbeddedSolr",看合不合你的胃口 : http://wiki.apache.org/solr/EmbeddedSolr

[Lucene] Payload一般只用于过滤、打分、排序等

我原以为可以search阶段直接把特定的payload取出来,然后打印一下,但google了很久,似乎没有直接的API. Payload可能本来就不适用于这种用况。 Lucene in Action说, 引用 "… use it during search, either to decide which documents are included in the search results or to alter how matched documents are scored or sorted"

Lucene: snowball一点都不好用

package player.kent.chen.temp.lucene.stemming; import java.io.IOException; public class MyLuceneStemmingDemo { private final static String allText = "The companies organized an better activity than the individuals."; public static void main(String[] args) throws Exception { Analyzer snowball = new SnowballAnalyzer(Version.LUCENE_30, "English"); doSearch(snowball, "company"); //搜不到 doSearch(snowball, "compani"); //搜得到 doSearch(snowball, "organize"); //搜不到 doSearch(snowball, "organiz");//搜不到 doSearch(snowball, "organ");//搜得到 doSearch(snowball, "good");//搜不到 doSearch(snowball, "act");//搜不到 doSearch(snowball, …

Lucene: snowball一点都不好用 Read More »

代码示例:Lucene Highlighter

这里用的是FastVectorHighlighter,可以高效地对付大文件 <!–pom.xml–> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-fast-vector-highlighter</artifactId> <version>3.0.0</version> </dependency> package player.kent.chen.temp.lucene.highlight; import java.io.File; import org.apache.commons.io.FileUtils; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.Field.Index; import org.apache.lucene.document.Field.Store; import org.apache.lucene.document.Field.TermVector; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; public class MyHighlightIndexer { public static void main(String[] args) throws Exception { String rootDir = "/home/kent/diskD/home-kent-dev/workspace/kent-temp/data/lucene-sanguo"; File contentDir = new File(rootDir, "content"); …

代码示例:Lucene Highlighter Read More »

Lucene: Query vs Filter

Query: How well does this document match the search condition? A question of score Filter: Does the document match the search condition, or not? A question of true or false. Filters can be used for exact matching, range queries etc. Filtering is faster the querying because it doesn’t care about scoring.

Lucene代码示例:使用SpanQuery找到keyword在文档中第一次出现的地方

无干货,仅供复制 位置信息类 package player.kent.chen.temp.lucene.span; import org.apache.commons.lang.builder.ToStringBuilder; public class KeywordLocation { private String file; /** * position in the token stream */ private int position; private KeywordLocation() { } public static final KeywordLocation createInstance(String file, int position) { KeywordLocation instance = new KeywordLocation(); instance.file = file; instance.position = position; return instance; } public String getFile() { …

Lucene代码示例:使用SpanQuery找到keyword在文档中第一次出现的地方 Read More »

Lucene Analyzer中的Position Increment

带点语病地说,Position Increment 代表token之间的“间隙值”。 一般来说,这个值等于1.   比如 Obama is a politician 分拆后, 引用 Obama      – position1 is         – position2 a          – position3 politician – position4 1,2,3,4 以1累进 如果Position Increment大于1,则代表有的词省略了。 引用 Obama      – position1 politician – position4 从1直接跳跃到4 如果Position Increment为0,则一般是因为Analyzer配上了同义词 引用 Obama      – position1 politician – position4 statesman  – position4 politician和statesman同义,它们的位置都是4