-
Notifications
You must be signed in to change notification settings - Fork 233
fix: slow hnsw by caching num docs #1706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
@@ -403,7 +406,7 @@ def num_docs(self) -> int: | |||
""" | |||
Get the number of documents. | |||
""" | |||
return self._get_num_docs_sqlite() | |||
return self._num_docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we update this value?
I think it may be better to handle it here totally: here do:
self._num_docs = self._num_docs or self._get_num_docs_sqlite()
return self._num_docs
And simply in all updates, deletes, index u set self._num_docs to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we update the value in the end of index and del, I think this way is more intuitive rather than setting it to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okey, and does it have update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes there are sufficient tests for this and they all pass. I modified it to the way you suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1706 +/- ##
=======================================
Coverage 85.51% 85.51%
=======================================
Files 132 132
Lines 8303 8306 +3
=======================================
+ Hits 7100 7103 +3
Misses 1203 1203
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
📝 Docs are deployed on https://ft-fix-hnsw-performance--jina-docs.netlify.app 🎉 |
As the user reported, num_docs() operation is expensive and slows down the search. This PR addresses the issue by storing/caching num_docs and updating it after index/del
I tested on this simple code snippet and while index time increases slightly (from 6.13 to 6.14 seconds) we can see the speedup in the search time (from 0.0238 to 0.0018, 13x speedup)