fix: slow hnsw by caching num docs #1706

jupyterjazz · 2023-07-18T07:46:25Z

As the user reported, num_docs() operation is expensive and slows down the search. This PR addresses the issue by storing/caching num_docs and updating it after index/del

I tested on this simple code snippet and while index time increases slightly (from 6.13 to 6.14 seconds) we can see the speedup in the search time (from 0.0238 to 0.0018, 13x speedup)

from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
import numpy as np
import time

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]


docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for _ in range(20000)]
index = HnswDocumentIndex[MyDoc](work_dir='tst', index_name='index')

index_start = time.time()
index.index(docs=DocList[MyDoc](docs))
index_time = time.time() - index_start

query = docs[0]

find_start = time.time()
matches, _ = index.find(query, search_field='embedding', limit=10)
find_time = time.time() - find_start

assert len(matches) == 10
assert query.id == matches[0].id

Signed-off-by: jupyterjazz <[email protected]>

JoanFM · 2023-07-18T07:50:11Z

docarray/index/backends/hnswlib.py

@@ -403,7 +406,7 @@ def num_docs(self) -> int:
        """
        Get the number of documents.
        """
-        return self._get_num_docs_sqlite()
+        return self._num_docs


How do we update this value?

I think it may be better to handle it here totally: here do:

self._num_docs = self._num_docs or self._get_num_docs_sqlite() return self._num_docs

And simply in all updates, deletes, index u set self._num_docs to 0.

we update the value in the end of index and del, I think this way is more intuitive rather than setting it to 0

okey, and does it have update?

yes there are sufficient tests for this and they all pass. I modified it to the way you suggested

JoanFM

small comment

codecov · 2023-07-18T07:51:53Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (b306c80) 85.51% compared to head (ace5e1a) 85.51%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1706   +/-   ##
=======================================
  Coverage   85.51%   85.51%           
=======================================
  Files         132      132           
  Lines        8303     8306    +3     
=======================================
+ Hits         7100     7103    +3     
  Misses       1203     1203

Flag	Coverage Δ
docarray	`85.51% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
docarray/index/backends/hnswlib.py	`95.33% <100.00%> (+0.05%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: jupyterjazz <[email protected]>

github-actions · 2023-07-18T08:17:37Z

📝 Docs are deployed on https://ft-fix-hnsw-performance--jina-docs.netlify.app 🎉

jupyterjazz added 2 commits July 18, 2023 09:45

fix: slow hnsw by caching num docs

6d31526

Signed-off-by: jupyterjazz <[email protected]>

chore: remove unused line

232f6be

Signed-off-by: jupyterjazz <[email protected]>

jupyterjazz linked an issue Jul 18, 2023 that may be closed by this pull request

HNSWlib wrapper, very slow due to a simple recomputation bug #1703

Closed

6 tasks

github-actions bot added size/xs area/core labels Jul 18, 2023

JoanFM reviewed Jul 18, 2023

View reviewed changes

JoanFM requested changes Jul 18, 2023

View reviewed changes

refactor: set num docs to 0

a411c5e

Signed-off-by: jupyterjazz <[email protected]>

jupyterjazz requested a review from JoanFM July 18, 2023 08:01

jupyterjazz marked this pull request as ready for review July 18, 2023 08:02

jupyterjazz and others added 2 commits July 18, 2023 10:12

refactor: go back to the initial solution

916ed27

Signed-off-by: jupyterjazz <[email protected]>

Merge branch 'main' into fix-hnsw-performance

ace5e1a

JoanFM approved these changes Jul 18, 2023

View reviewed changes

jupyterjazz merged commit c566401 into main Jul 18, 2023

jupyterjazz deleted the fix-hnsw-performance branch July 18, 2023 08:29

jupyterjazz mentioned this pull request Jul 18, 2023

Release Notes v0.36.0 #1707

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: slow hnsw by caching num docs #1706

fix: slow hnsw by caching num docs #1706

Uh oh!

jupyterjazz commented Jul 18, 2023 •

edited

Loading

Uh oh!

JoanFM Jul 18, 2023 •

edited

Loading

Uh oh!

jupyterjazz Jul 18, 2023

Uh oh!

JoanFM Jul 18, 2023

Uh oh!

jupyterjazz Jul 18, 2023

Uh oh!

JoanFM left a comment

Uh oh!

codecov bot commented Jul 18, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Jul 18, 2023

Uh oh!

Uh oh!

fix: slow hnsw by caching num docs #1706

fix: slow hnsw by caching num docs #1706

Uh oh!

Conversation

jupyterjazz commented Jul 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoanFM Jul 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jupyterjazz Jul 18, 2023

Choose a reason for hiding this comment

Uh oh!

JoanFM Jul 18, 2023

Choose a reason for hiding this comment

Uh oh!

jupyterjazz Jul 18, 2023

Choose a reason for hiding this comment

Uh oh!

JoanFM left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jul 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jul 18, 2023

Uh oh!

Uh oh!

jupyterjazz commented Jul 18, 2023 •

edited

Loading

JoanFM Jul 18, 2023 •

edited

Loading

codecov bot commented Jul 18, 2023 •

edited

Loading