Skip to content

refactor: hnswlib performance #1727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 31, 2023
Merged

refactor: hnswlib performance #1727

merged 15 commits into from
Jul 31, 2023

Conversation

JoanFM
Copy link
Member

@JoanFM JoanFM commented Jul 26, 2023

No description provided.

Joan Fontanals Martinez added 2 commits July 26, 2023 12:44
@codecov
Copy link

codecov bot commented Jul 26, 2023

Codecov Report

Patch coverage: 76.59% and project coverage change: -0.01% ⚠️

Comparison is base (896c20b) 84.46% compared to head (c062c8c) 84.45%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1727      +/-   ##
==========================================
- Coverage   84.46%   84.45%   -0.01%     
==========================================
  Files         134      134              
  Lines        8798     8834      +36     
==========================================
+ Hits         7431     7461      +30     
- Misses       1367     1373       +6     
Flag Coverage Δ
docarray 84.45% <76.59%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
docarray/index/backends/hnswlib.py 74.28% <76.59%> (+0.58%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Joan Fontanals Martinez <[email protected]>
@JoanFM JoanFM force-pushed the refactor-hnswlib-performance branch from f7bd4d0 to 3f56d44 Compare July 27, 2023 01:11
Signed-off-by: Joan Fontanals Martinez <[email protected]>
@JoanFM JoanFM force-pushed the refactor-hnswlib-performance branch from 4f6c270 to f8bb5f9 Compare July 27, 2023 03:17
@github-actions github-actions bot added size/m and removed size/s labels Jul 27, 2023
…ay/docarray into refactor-hnswlib-performance

Signed-off-by: Joan Fontanals Martinez <[email protected]>
@JoanFM JoanFM force-pushed the refactor-hnswlib-performance branch from 3b0922a to 79b0fea Compare July 27, 2023 05:33
@JoanFM JoanFM force-pushed the refactor-hnswlib-performance branch 2 times, most recently from fdfcb56 to 74a187e Compare July 28, 2023 02:27
Signed-off-by: Joan Fontanals Martinez <[email protected]>
@JoanFM JoanFM force-pushed the refactor-hnswlib-performance branch from 74a187e to 8c0bf6c Compare July 28, 2023 02:32
@JoanFM JoanFM marked this pull request as ready for review July 28, 2023 18:03
@github-actions
Copy link

📝 Docs are deployed on https://ft-refactor-hnswlib-performance--jina-docs.netlify.app 🎉

@JoanFM JoanFM requested a review from jupyterjazz July 30, 2023 13:16
@JoanFM JoanFM requested review from JohannesMessner, samsja and jupyterjazz and removed request for jupyterjazz, JohannesMessner and samsja July 30, 2023 13:16
@@ -332,7 +340,19 @@ def _filter(
limit: int,
) -> DocList:
rows = self._execute_filter(filter_query=filter_query, limit=limit)
return DocList[self.out_schema](self._doc_from_bytes(blob) for _, blob in rows) # type: ignore[name-defined]
hashed_ids = [doc_id for doc_id, _ in rows]
embeddings: OrderedDict[str, list] = OrderedDict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why specifically OrderedDict? I think normal dict in python will already be ordered by default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like to be specific about the requirement for order

Copy link
Contributor

@jupyterjazz jupyterjazz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any measure on how much faster hnsw works now?

@@ -127,7 +132,12 @@ def __init__(self, db_config=None, **kwargs):
else:
self._hnsw_indices[col_name] = self._create_index(col_name, col)
self._logger.info(f'Created a new index for column `{col_name}`')
if self._hnsw_indices[col_name].space == 'cosine':
cosine_metric_index_exist = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we care about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because for cosine, HNSWLib normalizes the vectors, and then if we retrieve, they have chanded, so no consistent API can be provided

@JoanFM
Copy link
Member Author

JoanFM commented Jul 31, 2023

do we have any measure on how much faster hnsw works now?

We have observed around 10% improvement

@JoanFM JoanFM requested review from jupyterjazz and samsja July 31, 2023 08:36
@JoanFM JoanFM merged commit a643f6a into main Jul 31, 2023
@JoanFM JoanFM deleted the refactor-hnswlib-performance branch July 31, 2023 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants