Table of Contents

1월 17일 (수) TIL

ElasticSearch 에서 쿼리를 작성하던 중 wildcard 쿼리의 결과가 내가 생각했던 것 과는 달라서 내용을 정리해본다. wildcard query를 작성할 때 기대한 것은 RDBMS 의 like ‘%keyword%’ 와 같은 형태가 가능할 것으로 기대했는데, 막상 쿼리 결과를 확인해 보니 원하는 형태가 아니었다.

original text 가 *“여러개의 물건들”*이고, 내가 시도한 쿼리는 다음과 같았다.

{
  "query": {
    "bool": {
      "must": [
        {
          "wildcard": {
            "title": "*여러개*"
          }
        }
      ]
    }
  },
  "size": 100
}

그런데 결과는 아래와 같이 hits count 가 0 이었다.

{
  "took": 243,
  "timed_out": false,
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

원인은 바로 wildcard query 가 term level query 이기 때문이다.

term level query 문서를 확인해 보자. (진작에 좀 읽었어야 하는데..)

term 즉 inverted index 를 기준으로 결과를 찾는다는 의미다. 다시 말해 analyzed 된 term keyword 가 있어야 하며, 문서를 색인할 때 토크나이징된 단어가 아니라면 inverted index에 들어 있지 않기 때문에, 아예 비교 대상에 포함되지 않는 것이다.

내가 테스트한 custom analyzer의 토크나이징은 아래와 같았으니,

curl -XPOST 'localhost:9200/my_index/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_custom_analyzer",
  "text": "여러개의 물건들"
}


{
  "tokens" : [
    {
      "token" : "여러",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "MM",
      "position" : 0
    },
    {
      "token" : "개",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "NNB",
      "position" : 1
    },
    {
      "token" : "물건",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "NNG",
      "position" : 2
    },
    {
      "token" : "물건들",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "NNG",
      "position" : 2
    }
  ]
}

토크나이징된 결과에 “여러개” 라는 단어는 없기 때문에, 처음 질의한 wildcard 결과가 hits가 0 일 수 밖에 없었던 것이다.

아 문서를 좀 더 잘 읽자.. 오늘의 교훈 “RTFM”

comments powered by Disqus

FEATURED TAGS

2024 3 keywords accelerating team success accesscontrol adaptive growth alpine linux ansible ansible tower apache2.4 architecture argocd asgi aws awx backoffice beautiful goodbye begining docker benchmark blockhound blocking code blog book brew cask build cache collaboration communication composer confluence cronjob curl custom scheduler customizing customlog data.sql database dependency management developer difference docker docker desktop dockerize dos attack eks elasticsearch empty string environment to do well errorlogformat event driven experience fast team file upload filebeat fluentd forwarded option framework fuse.js golang gpu workload graceful deploy happy new year haproxy haproxy 1.8 haproxy acl logging haproxy custom variable logging haproxy metric haproxy reload fail haproxy stat heap memory hedgehog sharp how do i work hugo index alias install installation instance type inverted index ioc istio iterm java jdk 21 jekyll jekyll to hugo jetbrains json k8s k8s api kmooc kubernetes leadership learning in 2019 let's encrypt letsencrypt limit_req_zone litestar logstash m1 macbook pro macro macro key mapping mapping match match_phrase maven mentoring microservice 설계 및 구현 migration mm multi datasource mysql nginx nginx ingress nginx regexp ngram node drain null openjdk partial matching phpstorm poddisruptionbudget poeaa portfolio expansion product strategy python python framework query dsl rate limit rate litmiting rds reindex resume feedback rss template seamless reload search on hugo service account & role sidebar search software engineer career speed up speedup spring spring boot spring camp spring camp 2023 spring event ssh struct study system manager systemd reload teamwork template term throughput timezone tips traffic management ulid unused plugins uuid ux consideration virtual thread vm options webflux wildcard year review zero downtime 만들면서 배우는 클린 아키텍처 시작하세요 도커 엔터프라이즈 애플리케이션 아키텍처 패턴 쿠버네티스 시작하기