Lucene Search Text Search Engine

Posted on September 16, 2016 by By Somen Sarkar, in Javascript | 0

Introduction to Lucene API

Lucene is a text search engine java API library.
It majorly helps in indexing and searching.
It searches by calculating a score for each document and gives the result based on the score that is more accurate.
A document can be searched using the search phrase, wildcard characters, range query.
Lucene is high-performance and scalable.

Some of the important classes in Lucene API are

Analyzer:- This abstract class main task is to extract tokens for indexing.
Some implementing classes are StandardAnalyzer.

IndexWriter:- The index-writer helps in creating indexes and write the indexes to file inside a directory that is passed as a parameter.

Fields: A document can be broken down into multiple fields. The fields are attributed that can store some values.
Document:- This class contains a collection of Fields. It helps in creating a virtual document. When looping through a directory whose files have to be indexed, we can construct the many fields object for a single file and then put in the document object. This document object is then passed to the IndexWriter, which helps indexWriter to create indexes for that specific file.
IndexReader: This class can be used to open an existing index from indexDirectory and helps in searching for a user-provided search phrase.
QueryParser: The query parser uses the file content as input and finds a match.

Query: QueryParser returns a Query object after parsing the query string.

To execute the query and find the actual document contents we pass the query object to IndexSearcher.search() method.

TopDocs object is returned from the IndexSearcher.search() method, it contains collection of the searched docs (scoreDocs)
ScoreDoc.: The scoreDoc can be used to convert the searched document to Indexed Doc.
The pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.helical.lucene</groupId>
    <artifactId>helical.lucene.search</artifactId>
    <version>1.0-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>6.1.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queries</artifactId>
            <version>6.1.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>6.1.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>6.1.0</version>
        </dependency>
    </dependencies>


</project>

LuceneIndexingAndSearchingExamples.java

import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import java.io.*;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Scanner;

public class LuceneIndexingAndSearchingExamples {

    private static final String indexStoreDirectory = "E:/search/indexes";
    private static final String directoryToIndex = "E:/samples/documents";

    public static void main(String[] args) {
        this.indexDirectory();
        Scanner scanner = new Scanner(System.in);
        System.out.print("Enter your search phrase\n");
        String searchText = scanner.nextLine();
        this.searchFromIndexes(searchText);
    }

    private static void indexDirectory() {
        try {
            Directory directory = FSDirectory.open(Paths.get(indexStoreDirectory));
            IndexWriterConfig configuration = new IndexWriterConfig(new SimpleAnalyzer());
            IndexWriter indexWriter = new IndexWriter(directory, configuration);

            //Delete all indexes if exits
            indexWriter.deleteAll();
            File directoryToIndexFile = new File(directoryToIndex);
            processEachFile(indexWriter, directoryToIndexFile);
            indexWriter.close();
            directory.close();
        } catch (FileNotFoundException fne) {
            System.out.println("The sample folder is not found");
        } catch (IOException ioe) {
            System.out.println("There was some problem in i/o operation");
        }
    }

    private static void processEachFile(IndexWriter indexWriter, File directoryToIndexFile) throws IOException {

        File[] files = directoryToIndexFile.listFiles();
        if (files == null) {
            throw new RuntimeException("Directory " + directoryToIndex + " is empty/not found");
        }
        for (File eachFile : files) {
            Document virtualDocument = new Document();
            virtualDocument.add(new TextField("path", eachFile.getName(), Store.YES));
            StringBuilder builder = getContents(eachFile);
            virtualDocument.add(new TextField("contents", builder.toString(), Store.YES));
            indexWriter.addDocument(virtualDocument);
        }
    }

    private static StringBuilder getContents(File eachFile) throws IOException {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(eachFile)));
        StringBuilder builder = new StringBuilder();
        String line;
        while ((line = reader.readLine()) != null) {
            builder.append(line).append("\n");
        }
        reader.close();
        return builder;
    }

    private static void searchFromIndexes(String text) {
        try {
            Path path = Paths.get(indexStoreDirectory);
            Directory directory = FSDirectory.open(path);
            IndexReader indexReader = DirectoryReader.open(directory);
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);
            QueryParser queryParser = new QueryParser("contents", new StandardAnalyzer());
            Query query = queryParser.parse(text);
            TopDocs topDocs = indexSearcher.search(query, 10);
            printOutput(indexSearcher, topDocs);
        } catch (IOException ioe) {
            ioe.printStackTrace();
        } catch (ParseException e) {
            e.printStackTrace();
        }
    }

    private static void printOutput(IndexSearcher indexSearcher, TopDocs topDocs) throws IOException {
        System.out.println("Total Matches found " + topDocs.totalHits);
        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
            Document virtualDocument = indexSearcher.doc(scoreDoc.doc);
            System.out.println("Path of document : " + virtualDocument.get("path"));
            System.out.println("\n=================\nContents \n=================\n" + virtualDocument.get("contents"));
        }
    }
}

Best Open Source Business Intelligence Software Helical Insight is Here

A Business Intelligence Framework

ref:
Apache Lucene 5.1.0 indexing and searching java example

directory index java lucene search

0 0 votes

Article Rating

0 Comments

Inline Feedbacks

View all comments

You might also like..

Pentaho

Why Delaying Your SSRS Migration Puts Your Business at Risk

By admin

For deeper insights, you may also explore: SSRS Reports end of life SSRS structured migration approach SQL Server Reporting Services (SSRS) has served enterprises well for traditional reporting needs. However, as Microsoft continues to evolve toward cloud-first and modern analytics...

Pentaho

Inside an SSRS to Pentaho Migration – Step-by-Step Methodology

By admin

For deeper insights, you may also explore: why SSRS Reports migration is required SSRS automation limitations Migrating from SQL Server Reporting Services (SSRS) to Pentaho is a strategic initiative for organizations seeking scalable, flexible, and open-source business intelligence. A successful...

Pentaho

The Reality of Automated SSRS to Pentaho Migration

By admin

For deeper insights, you may also explore: Crystal Reports vs Jaspersoft SSRS step-by-step migration approach Organizations modernizing their BI and reporting stacks often face a familiar challenge: migrating from SQL Server Reporting Services (SSRS) to more flexible, open-source platforms like...

About Helical IT Solutions Pvt Ltd

Location

Contact Us

Search what you are looking for..

Lucene Search Text Search Engine

Posted on September 16, 2016 by By Somen Sarkar, in Javascript | 0

Introduction to Lucene API

A Business Intelligence Framework

You might also like..

Pentaho

Why Delaying Your SSRS Migration Puts Your Business at Risk

By admin

Pentaho

Inside an SSRS to Pentaho Migration – Step-by-Step Methodology

By admin

Pentaho

The Reality of Automated SSRS to Pentaho Migration

By admin

Contact Form