lucene7.1.0实现搜索文件内容

Lucene的使用主要体现在两个步骤：

　　1 创建索引，通过IndexWriter对不同的文件进行索引的创建，并将其保存在索引相关文件存储的位置中。

　　2 通过索引查寻关键字相关文档。

首先，我们需要定义一个词法分析器。

Analyzer analyzer = new IKAnalyzer(true);

注意各种词法分析器的区别，详见　　http://blog.csdn.net/silentmuh/article/details/78451786

比如一句话，“我爱我们的中国！”，如何对他拆分，扣掉停顿词“的”，提取关键字“我”“我们”“中国”等等。这就要借助的词法分析器Analyzer来实现。这里面使用的是标准的词法分析器，如果专门针对汉语，还可以搭配paoding，进行使用。

第二步，确定索引文件存储的位置，Lucene提供给我们两种方式：

Directory directory = FSDirectory.open(FileSystems.getDefault().getPath(INDEX_DIR));

　　1 本地文件存储

第三步，创建IndexWriter，进行索引文件的写入。

IndexWriterConfig config = new IndexWriterConfig(analyzer);

IndexWriter indexWriter = new IndexWriter(directory, config);

第四步，内容提取，进行索引的存储。

Document doc = new Document();

String text = "This is the text to be indexed.";

doc.add(new Field("fieldname", text, TextField.TYPE_STORED));

iwriter.addDocument(doc);

iwriter.close();

　　第一行，申请了一个document对象，这个类似于数据库中的表中的一行。

　　第二行，是我们即将索引的字符串。

　　第三行，把字符串存储起来（因为设置了TextField.TYPE_STORED,如果不想存储，可以使用其他参数，详情参考官方文档），并存储“表明”为"fieldname".

　　第四行，把doc对象加入到索引创建中。

　　第五行，关闭IndexWriter,提交创建内容。

这就是索引创建的过程。

通过索引查寻关键字相关文档：

　第一步，打开存储位置

DirectoryReader ireader = DirectoryReader.open(directory);

　　第二步，创建搜索器

IndexSearcher isearcher = new IndexSearcher(ireader);

　　第三步，类似SQL，进行关键字查询

QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "fieldname", analyzer);

Query query = parser.parse("text");

ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;

assertEquals(1, hits.length);

for (int i = 0; i < hits.length; i++) {

    Document hitDoc = isearcher.doc(hits[i].doc);

    assertEquals("This is the text to be indexed.",hitDoc.get("fieldname"));

}

　　这里，我们创建了一个查询器，并设置其词法分析器，以及查询的“表名“为”fieldname“。查询结果会返回一个集合，类似SQL的ResultSet，我们可以提取其中存储的内容。

　　关于各种不同的查询方式，可以参考官方手册，或者推荐的PPT

　　第四步，关闭查询器等。

ireader.close();

directory.close();

最后，自己写了个简单的例子，可以对一个文件夹内的内容进行索引的创建，并根据关键字筛选文件，并读取其中的内容。

 package muh.test;

 import java.io.BufferedReader;

 import java.io.File;

 import java.io.FileInputStream;

 import java.io.FileNotFoundException;

 import java.io.FileReader;

 import java.io.FilenameFilter;

 import java.io.IOException;

 import java.io.InputStreamReader;

 import java.nio.file.FileSystems;

 import java.util.ArrayList;

 import java.util.Date;

 import java.util.List;

 import org.apache.lucene.analysis.Analyzer;

 import org.apache.lucene.analysis.core.SimpleAnalyzer;

 import org.apache.lucene.analysis.standard.StandardAnalyzer;

 import org.apache.lucene.document.Document;

 import org.apache.lucene.document.Field;

 import org.apache.lucene.document.TextField;

 import org.apache.lucene.index.DirectoryReader;

 import org.apache.lucene.index.IndexReader;

 import org.apache.lucene.index.IndexWriter;

 import org.apache.lucene.index.IndexWriterConfig;

 import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;

 import org.apache.lucene.queryparser.classic.ParseException;

 import org.apache.lucene.queryparser.classic.QueryParser;

 import org.apache.lucene.search.BooleanClause;

 import org.apache.lucene.search.IndexSearcher;

 import org.apache.lucene.search.Query;

 import org.apache.lucene.search.ScoreDoc;

 import org.apache.lucene.search.TopDocs;

 import org.apache.lucene.search.TopScoreDocCollector;

 import org.apache.lucene.search.highlight.Highlighter;

 import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;

 import org.apache.lucene.search.highlight.QueryScorer;

 import org.apache.lucene.search.highlight.SimpleHTMLFormatter;

 import org.apache.lucene.search.highlight.SimpleSpanFragmenter;

 import org.apache.lucene.store.Directory;

 import org.apache.lucene.store.FSDirectory;

 import org.apache.lucene.util.Version;

 import org.apache.poi.hssf.usermodel.HSSFCell;

 import org.apache.poi.hssf.usermodel.HSSFRow;

 import org.apache.poi.hssf.usermodel.HSSFSheet;

 import org.apache.poi.hssf.usermodel.HSSFWorkbook;

 import org.apache.poi.hwpf.HWPFDocument;

 import org.apache.poi.hwpf.usermodel.Range;

 import org.apache.poi.xssf.usermodel.XSSFCell;

 import org.apache.poi.xssf.usermodel.XSSFRow;

 import org.apache.poi.xssf.usermodel.XSSFSheet;

 import org.apache.poi.xssf.usermodel.XSSFWorkbook;

 import org.apache.poi.xwpf.extractor.XWPFWordExtractor;

 import org.apache.poi.xwpf.usermodel.XWPFDocument;

 import org.pdfbox.pdfparser.PDFParser;

 import org.pdfbox.pdmodel.PDDocument;

 import org.pdfbox.util.PDFTextStripper;

 import org.wltea.analyzer.lucene.IKAnalyzer;

 public class LuceneTest {

     private static String INDEX_DIR = "E:\\luceneIndex";

     private static String Source_DIR = "E:\\luceneSource";

     /**

      * 列出某个路径下的所有文件，包括子文件夹，如果本身就是文件，那么返回自身,需要遍历的文件路径,文件名过滤器

      * @Title: listAllFiles

      * @author hegg

      * @date 2017年11月6日 下午8:28:54

      * @param filePath

      * @param fileNameFilter

      * @return 返回类型 List<File>

      */

     public static List<File> listAllFiles(String filePath, FilenameFilter fileNameFilter) {

         List<File> files = new ArrayList<File>();

         try {

             File root = new File(filePath);

             if (!root.exists())

                 return files;

             if (root.isFile())

                 files.add(root);

             else {

                 for (File file : root.listFiles(fileNameFilter)) {

                     if (file.isFile())

                         files.add(file);

                     else if (file.isDirectory()) {

                         files.addAll(listAllFiles(file.getAbsolutePath(), fileNameFilter));

                     }

                 }

             }

         } catch (Exception e) {

             e.printStackTrace();

         }

         return files;

     }

     /**

      * 删除文件目录下的所有文件

      * @Title: deleteDir

      * @author hegg

      * @date 2017年11月6日 下午8:29:16

      * @param file

      * @return 返回类型 boolean

      */

     public static boolean deleteDir(File file) {

         if (file.isDirectory()) {

             File[] files = file.listFiles();

             for (int i = 0; i < files.length; i++) {

                 deleteDir(files[i]);

             }

         }

         file.delete();

         return true;

     }

     /**

      * 读取txt文件的内容

      * @Title: readTxt

      * @author hegg

      * @date 2017年11月6日 下午8:15:49

      * @param file

      * @return 返回类型 String

      */

     public static String readTxt(File file) {

         String result = "";

         try {

             BufferedReader br = new BufferedReader(new FileReader(file));// 构造一个BufferedReader类来读取文件

             String s = null;

             while ((s = br.readLine()) != null) {// 使用readLine方法，一次读一行

                 result = result + "\n" + s;

             }

             br.close();

         } catch (Exception e) {

             e.printStackTrace();

         }

         return result;

     }

     /**

      * 读取Word内容，包括03格式和07格式

      * @Title: readWord

      * @author hegg

      * @date 2017年11月6日 下午8:15:14

      * @param file

      * @param type

      * @return 返回类型 String

      */

     public static String readWord(File file, String type) {

         String result = "";

         try {

             FileInputStream fis = new FileInputStream(file);

             if ("doc".equals(type)) {

                 HWPFDocument doc = new HWPFDocument(fis);

                 Range rang = doc.getRange();

                 result += rang.text();

             }

             if ("docx".equals(type)) {

                 XWPFDocument doc = new XWPFDocument(fis);

                 XWPFWordExtractor extractor = new XWPFWordExtractor(doc);

                 result = extractor.getText();

             }

             fis.close();

         } catch (Exception e) {

             e.printStackTrace();

         }

         return result;

     }

     /**

      * 读取Excel文件内容，包括03格式和07格式

      * @Title: readExcel

      * @author hegg

      * @date 2017年11月6日 下午8:14:04

      * @param file

      * @param type

      * @return 返回类型 String

      */

     public static String readExcel(File file, String type) {

         String result = "";

         try {

             FileInputStream fis = new FileInputStream(file);

             StringBuilder sb = new StringBuilder();

             if ("xlsx".equals(type)) {

                 XSSFWorkbook xwb = new XSSFWorkbook(fis);

                 for (int i = 0; i < xwb.getNumberOfSheets(); i++) {

                     XSSFSheet sheet = xwb.getSheetAt(i);

                     for (int j = 0; j < sheet.getPhysicalNumberOfRows(); j++) {

                         XSSFRow row = sheet.getRow(j);

                         for (int k = 0; k < row.getPhysicalNumberOfCells(); k++) {

                             XSSFCell cell = row.getCell(k);

                             sb.append(cell.getRichStringCellValue());

                         }

                     }

                 }

             }

             if ("xls".equals(type)) {

                 // 得到Excel工作簿对象

                 HSSFWorkbook hwb = new HSSFWorkbook(fis);

                 for (int i = 0; i < hwb.getNumberOfSheets(); i++) {

                     HSSFSheet sheet = hwb.getSheetAt(i);

                     for (int j = 0; j < sheet.getPhysicalNumberOfRows(); j++) {

                         HSSFRow row = sheet.getRow(j);

                         for (int k = 0; k < row.getPhysicalNumberOfCells(); k++) {

                             HSSFCell cell = row.getCell(k);

                             sb.append(cell.getRichStringCellValue());

                         }

                     }

                 }

             }

             fis.close();

             result += sb.toString();

         } catch (Exception e) {

             e.printStackTrace();

         }

         return result;

     }

     /**

      * 读取pdf文件内容

      * @Title: readPDF

      * @author hegg

      * @date 2017年11月6日 下午8:13:41

      * @param file

      * @return 返回类型 String

      */

     public static String readPDF(File file) {

         String result = null;

         FileInputStream is = null;

         PDDocument document = null;

         try {

             is = new FileInputStream(file);

             PDFParser parser = new PDFParser(is);

             parser.parse();

             document = parser.getPDDocument();

             PDFTextStripper stripper = new PDFTextStripper();

             result = stripper.getText(document);

         } catch (FileNotFoundException e) {

             e.printStackTrace();

         } catch (IOException e) {

             e.printStackTrace();

         } finally {

             if (is != null) {

                 try {

                     is.close();

                 } catch (IOException e) {

                     e.printStackTrace();

                 }

             }

             if (document != null) {

                 try {

                     document.close();

                 } catch (IOException e) {

                     e.printStackTrace();

                 }

             }

         }

         return result;

     }

     /**

      * 读取html文件内容

      * @Title: readHtml

      * @author hegg

      * @date 2017年11月6日 下午8:13:08

      * @param file

      * @return 返回类型 String

      */

     public static String readHtml(File file) {

         StringBuffer content = new StringBuffer("");

         FileInputStream fis = null;

         try {

             fis = new FileInputStream(file);

             // 读取页面

             BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"utf-8"));//这里的字符编码要注意，要对上html头文件的一致，否则会出乱码

             String line = null;

             while ((line = reader.readLine()) != null) {

                 content.append(line + "\n");

             }

             reader.close();

         } catch (Exception e) {

             e.printStackTrace();

         }

         String contentString = content.toString();

         return contentString;

     }

     /**

      * 创建索引

      * @Title: creatIndex

      * @author hegg

      * @date 2017年11月6日 下午8:29:37 返回类型 void

      */

     public static void creatIndex() {

         Date begin = new Date();

         // 1、创建Analyzer词法分析器，注意SimpleAnalyzer和StandardAnalyzer的区别

         Analyzer analyzer  = null;

         // 2、创建directory,保存索引,可以保存在内存中也可以保存在硬盘上

         Directory directory = null;

         // 3、创建indexWriter创建索引

         IndexWriter indexWriter = null;

         try {

 //            analyzer = new StandardAnalyzer();

 //            analyzer = new SimpleAnalyzer();

             analyzer = new IKAnalyzer(true);

 //            directory = FSDirectory.open(new File(INDEX_DIR));

             directory = FSDirectory.open(FileSystems.getDefault().getPath(INDEX_DIR));

             // 4、创建indexwriterConfig,并指定分词器版本

             IndexWriterConfig config = new IndexWriterConfig(analyzer);

             // 5、创建IndexWriter,需要使用IndexWriterConfig,

             indexWriter = new IndexWriter(directory, config);

             indexWriter.deleteAll();

             File docDirectory = new File(Source_DIR);

             for (File file : docDirectory.listFiles()) {

                 String content = "";

                 //获取文件后缀

                 String type = file.getName().substring(file.getName().lastIndexOf(".")+1);

                 if("txt".equalsIgnoreCase(type)){

                     content += readTxt(file);

                 }else if("doc".equalsIgnoreCase(type)){

                     content += readWord(file,"doc");

                 }else if("docx".equalsIgnoreCase(type)){

                     content += readWord(file,"docx");

                 }else if("xls".equalsIgnoreCase(type)){

                     content += readExcel(file,"xls");

                 }else if("xlsx".equalsIgnoreCase(type)){

                     content += readExcel(file,"xlsx");

                 }else if("pdf".equalsIgnoreCase(type)){

                     content += readPDF(file);

                 }else if("html".equalsIgnoreCase(type)){

                     content += readHtml(file);

                 }

                 // 6、创建document

                 Document document = new Document();

                 document.add(new Field("content", content, TextField.TYPE_STORED));

                 document.add(new Field("fileName", file.getName(), TextField.TYPE_STORED));

                 document.add(new Field("filePath", file.getAbsolutePath(), TextField.TYPE_STORED));

                 document.add(new Field("updateTime", file.lastModified() + "", TextField.TYPE_STORED));

                 indexWriter.addDocument(document);

             }

             indexWriter.commit();

         } catch (Exception e) {

             e.printStackTrace();

         } finally {

             try {

                 if (analyzer != null) analyzer.close();

                 if (indexWriter != null) indexWriter.close();

                 if (directory != null) directory.close();

             } catch (IOException e) {

                 e.printStackTrace();

             }

         }

         Date end = new Date();

         System.out.println("创建索引-----耗时：" + (end.getTime() - begin.getTime()) + "ms\n");

     }

     /**

      * 查找索引，返回符合条件的文件

      * @Title: searchIndex

      * @author hegg

      * @date 2017年11月6日 下午8:29:31

      * @param keyWord 返回类型 void

      */

     public static void searchIndex(String keyWord) {

         Date begin = new Date();

         // 1、创建Analyzer词法分析器，注意SimpleAnalyzer和StandardAnalyzer的区别

         Analyzer analyzer  = null;

         // 2、创建索引在的文件夹

         Directory indexDirectory = null;

         // 3、创建DirectoryReader

         DirectoryReader directoryReader = null;

         try {

 //            analyzer = new StandardAnalyzer();

 //            analyzer = new SimpleAnalyzer();

             analyzer = new IKAnalyzer(true);

 //            indexDirectory = FSDirectory.open(new File(INDEX_DIR));

             indexDirectory = FSDirectory.open(FileSystems.getDefault().getPath(INDEX_DIR));

             directoryReader = DirectoryReader.open(indexDirectory);

             // 3:根据DirectoryReader创建indexSeacher

             IndexSearcher indexSearcher = new IndexSearcher(directoryReader);

             // 4创建搜索用的query,指定搜索域

 //            QueryParser parser = new QueryParser(, "content", analyzer);

 //            Query query1 = parser.parse(keyWord);

 //            ScoreDoc[] hits = indexSearcher.search(query1, null, 1000).scoreDocs;

 //            for (int i = 0; i < hits.length; i++) {

 //                Document hitDoc = indexSearcher.doc(hits[i].doc);

 //                System.out.println("____________________________");

 //                System.out.println(hitDoc.get("content"));

 //                System.out.println(hitDoc.get("fileName"));

 //                System.out.println(hitDoc.get("filePath"));

 //                System.out.println(hitDoc.get("updateTime"));

 //                System.out.println("____________________________");

 //            }

             String[] fields = { "fileName", "content" }; // 要搜索的字段，一般搜索时都不会只搜索一个字段

             // 字段之间的与或非关系，MUST表示and，MUST_NOT表示not，SHOULD表示or，有几个fields就必须有几个clauses

             BooleanClause.Occur[] clauses = { BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD };

             Query query2 = MultiFieldQueryParser.parse(keyWord, fields, clauses, analyzer);

             // 5、根据searcher搜索并且返回TopDocs

             TopDocs topDocs = indexSearcher.search(query2, 100); // 搜索前100条结果

             System.out.println("共找到匹配处：" + topDocs.totalHits); // totalHits和scoreDocs.length的区别还没搞明白

             ///6、根据TopDocs获取ScoreDoc对象

             ScoreDoc[] scoreDocs = topDocs.scoreDocs;

             System.out.println("共找到匹配文档数：" + scoreDocs.length);

             QueryScorer scorer = new QueryScorer(query2, "content");

             // 7、自定义高亮代码

             SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("<span style=\"backgroud-color:black;color:red\">", "</span>");

             Highlighter highlighter = new Highlighter(htmlFormatter, scorer);

             highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer));

             for (ScoreDoc scoreDoc : scoreDocs) {

                 ///8、根据searcher和ScoreDoc对象获取具体的Document对象

                 Document document = indexSearcher.doc(scoreDoc.doc);

                 System.out.println("-----------------------------------------");

                 System.out.println(document.get("fileName") + ":" + document.get("filePath"));

                 System.out.println(highlighter.getBestFragment(analyzer, "content", document.get("content")));

                 System.out.println("-----------------------------------------");

             }

         } catch (IOException e) {

             e.printStackTrace();

         } catch (ParseException e) {

             e.printStackTrace();

         } catch (InvalidTokenOffsetsException e) {

             e.printStackTrace();

         } finally {

             try {

                 if (analyzer != null) analyzer.close();

                 if (directoryReader != null) directoryReader.close();

                 if (indexDirectory != null) indexDirectory.close();

             } catch (Exception e) {

                 e.printStackTrace();

             }

         }

         Date end = new Date();

         System.out.println("查看关键字耗时：" + (end.getTime() - begin.getTime()) + "ms\n");

     }

     public static void main(String[] args) throws Exception {

         File fileIndex = new File(INDEX_DIR);

         if (deleteDir(fileIndex)) {

             fileIndex.mkdir();

         } else {

             fileIndex.mkdir();

         }

         creatIndex();

         searchIndex("天安门");

     }

 }

最后附上本例子用到的jar，下载地址链接：http://pan.baidu.com/s/1jI26UgQ 密码：qix6

巴特西

lucene7.1.0实现搜索文件内容

通过索引查寻关键字相关文档：

最新文章

热门文章