【Elasticsearch 7 探索之路】（四）Analyzer 分析- 学习笔记- 青岛软件培训-选择一家好的青岛软件培训学校，就要看教学质量和口碑

上一篇，什么是倒排索引以及原理是什么。本篇讲解 Analyzer，了解 Analyzer 是什么，分词器是什么，以及 Elasticsearch 内置的分词器，最后再讲解中文分词是怎么做的。

一、Analysis 与 Analyzer

Analysis 文本分析是把全文本转换一系列单词（term/token)的过程，也叫分词
，Analysis 是通过 Analyzer 来实现的。 Elasticsearch 有多种内置的分析器，如果不满足也可以根据自己的需求定制化分析器，除了在数据写入时转换词条，匹配 Query 语句时候也需要用相同的分析器对查询语句进行分析。

二、Analyzer 的组成

Character Filters (针对原始文本处理，例如，可以使用字符过滤器将印度阿拉伯数字（٠ ١٢٣٤٥٦٧٨ ٩）转换为其等效的阿拉伯语-拉丁语（0123456789）)
Tokenizer（按照规则切分为单词）,将把文本 "Quick brown fox!" 转换成 terms [Quick, brown, fox!],tokenizer 还记录文本单词位置以及偏移量。
Token Filter(将切分的的单词进行加工、小写、刪除 stopwords，增加同义词）

三、Analyzer 内置的分词器

例子：The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.

Standard Analyzer

默认分词器
按词分类
小写处理

#standard GET _analyze {   "analyzer": "standard",   "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }

输出：

[the,2,quick,brown,foxes,a,jumped,over,the,lazy,dog's,bone]

Simple Analyzer

按照非字母切分，非字母则会被去除
小写处理

#simpe GET _analyze {   "analyzer": "simple",   "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }

输出：

[the,quick,brown,foxes,jumped,over,the,lazy,dog,s,bone]

Stop Analyzer

小写处理
停用词过滤（the，a, is)

GET _analyze {   "analyzer": "stop",   "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }

输出：

[quick,brown,foxes,jumped,over,lazy,dog,s,bone]

Whitespace Analyzer

按空格切分

#stop GET _analyze {   "analyzer": "whitespace",   "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }

输出：

[The,2,QUICK,Brown-Foxes,jumped,over,the,lazy,dog's,bone.]

Keyword Analyzer

不分词，当成一整个 term 输出

#keyword GET _analyze {   "analyzer": "keyword",   "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }

输出：

[The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.]

Patter Analyzer

通过正则表达式进行分词
默认是 \W+(非字母进行分隔)

GET _analyze {   "analyzer": "pattern",   "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }

输出：

[the,2,quick,brown,foxes,jumped,over,the,lazy,dog,s,bone]

Language Analyzer

支持语言：arabic, armenian, basque, bengali, bulgarian, catalan, czech, dutch, english, finnish, french, galician, german, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, portuguese, romanian, russian, sorani, spanish, swedish, turkish.


                        关键字：

万码学堂2025年课程全面升级

【Elasticsearch 7 探索之路】（四）Analyzer 分析

青岛软件培训

联系我们

电话咨询

扫码添加微信