Elasticsearch实现中文分词+pinyin分词搜索效果

elasticsearch-analysis-combo

elasticsearch-logo.png



如果想将拼音搜索的索引和IK分词的索引结合,需要另一个插件:elasticsearch-analysis-combo

2.0.0版本需另一个分支:https://github.com/antonha/elasticsearch-analysis-combo/tree/6cd1cd24eb53e2dc968c7eff618885d7b89d0a22

(2.0以下版本master分支即可:https://github.com/antonha/elasticsearch-analysis-combo(编译方法同上)或者bin/plugin -install com.yakaz.elasticsearch.plugins/elasticsearch-analysis-combo/1.5.1)

然后配置文件里elasticsearch.yml:

index:
  analysis:
    analyzer:
      ik_max_word:
          type: ik
          use_smart: false
      ik_smart:
          type: ik
          use_smart: true
      pinyin:
        type: custom
        tokenizer: pinyin
        filter:
         - word_delimiter
         - pinyin_filter
         - lowercase
      combo:
        type: combo
        sub_analyzers: 
         - ik
         - pinyin    
    filter:
      pinyin_filter : 
        type : pinyin
        first_letter : none
        padding_char : ' '

type的maipping字段类型脚本:

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{
    "folks": {
        "properties": {
            "name": {
                "type": "string",
                "analyzer": "combo",
                "searchAnalyzer": "combo"
            }
        }
    }
}'

elasticsearch重启,导入执行脚本,导入数据

输入 http://localhost:9200/soubu/_analyze?analyzer=combo&pretty=true&text=%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD

{
  "tokens" : [ {
    "token" : "中华人民共和国",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "zhong",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "中华人民",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "hua",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "中华",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "ren",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "min",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "华人",
    "start_offset" : 1,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "gong",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "人民共和国",
    "start_offset" : 2,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "he",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "人民",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "guo",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "共和国",
    "start_offset" : 4,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 6
  }, {
    "token" : "共和",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 7
  }, {
    "token" : "国",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_CHAR",
    "position" : 8
  } ]
}

如果返回如上内容即可.

当然,如果想要实现类似于http://localhost:9200/soubu/_analyze?analyzer=combo&pretty=true&text=zhonghuarenmingongheguo返回
中华人民共和国的效果,目前没有找到合理的做法,只能求助PHP的正则来实现:

 $value){
                if(!empty($value)){
                    $string .= $value . ' ';
                }else{
                    continue;
                }
            }
            $string = substr($string,0,strlen($string)-1);
        }
    }
    $string = empty($string) ? $keyword : $string;
    return $string;
}

$keyword = $this->formatKeyword($keyword);
array_push($must,array('multi_match'=>array('query'=>$keyword,'fields'=>array('name^3', 'detail^3','shop_name^0.5'),'operator' => 'and')));
$params['body']['query'] = $must;
$Client = new \Elasticsearch\Client();
$results = $Client->search($params);
?>

其中,判断关键词是否是汉字,如果是,直接返回关键词,如果不是,根据正则匹配并格式化关键词。

然后通过es的multi_match来查询关键词。

fields字段可以通过caret语法(^)进行提升:仅需要在字段名后添加^boost,其中的boost是一个浮点数

我们得到的results就是通过格式化输入的拼音字符串获取到的搜索结果。

未完待续吧~



1 评论

发表评论

电子邮件地址不会被公开。 必填项已用*标注