Skip to content

fix: slow hnsw by caching num docs #1706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 18, 2023
Merged

fix: slow hnsw by caching num docs #1706

merged 5 commits into from
Jul 18, 2023

Conversation

jupyterjazz
Copy link
Contributor

@jupyterjazz jupyterjazz commented Jul 18, 2023

As the user reported, num_docs() operation is expensive and slows down the search. This PR addresses the issue by storing/caching num_docs and updating it after index/del

I tested on this simple code snippet and while index time increases slightly (from 6.13 to 6.14 seconds) we can see the speedup in the search time (from 0.0238 to 0.0018, 13x speedup)

from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
import numpy as np
import time

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]


docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for _ in range(20000)]
index = HnswDocumentIndex[MyDoc](work_dir='tst', index_name='index')

index_start = time.time()
index.index(docs=DocList[MyDoc](docs))
index_time = time.time() - index_start

query = docs[0]

find_start = time.time()
matches, _ = index.find(query, search_field='embedding', limit=10)
find_time = time.time() - find_start

assert len(matches) == 10
assert query.id == matches[0].id

@@ -403,7 +406,7 @@ def num_docs(self) -> int:
"""
Get the number of documents.
"""
return self._get_num_docs_sqlite()
return self._num_docs
Copy link
Member

@JoanFM JoanFM Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we update this value?

I think it may be better to handle it here totally: here do:

self._num_docs = self._num_docs or self._get_num_docs_sqlite()
return self._num_docs

And simply in all updates, deletes, index u set self._num_docs to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we update the value in the end of index and del, I think this way is more intuitive rather than setting it to 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okey, and does it have update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes there are sufficient tests for this and they all pass. I modified it to the way you suggested

Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment

@codecov
Copy link

codecov bot commented Jul 18, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (b306c80) 85.51% compared to head (ace5e1a) 85.51%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1706   +/-   ##
=======================================
  Coverage   85.51%   85.51%           
=======================================
  Files         132      132           
  Lines        8303     8306    +3     
=======================================
+ Hits         7100     7103    +3     
  Misses       1203     1203           
Flag Coverage Δ
docarray 85.51% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/index/backends/hnswlib.py 95.33% <100.00%> (+0.05%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: jupyterjazz <[email protected]>
@jupyterjazz jupyterjazz requested a review from JoanFM July 18, 2023 08:01
@jupyterjazz jupyterjazz marked this pull request as ready for review July 18, 2023 08:02
@github-actions
Copy link

📝 Docs are deployed on https://ft-fix-hnsw-performance--jina-docs.netlify.app 🎉

@jupyterjazz jupyterjazz merged commit c566401 into main Jul 18, 2023
@jupyterjazz jupyterjazz deleted the fix-hnsw-performance branch July 18, 2023 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HNSWlib wrapper, very slow due to a simple recomputation bug
2 participants