Skip to content

feat: filtering in hnsw #1718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 26, 2023
Merged

feat: filtering in hnsw #1718

merged 6 commits into from
Jul 26, 2023

Conversation

jupyterjazz
Copy link
Contributor

@jupyterjazz jupyterjazz commented Jul 20, 2023

Support Filtering for HnswDocumentIndex, separately as well as inside the query builder

Main changes:

  • Add filtrable fields separately to the sqlite table (so now we store id, doc_blob, and fields that can be filtered)
  • Support filter function which does not touch hnsw and executes filtering on the sqlite table
  • Refactor the query builder (hybrid search) to enable pre+post filtering. Post filtering is done in the same way; Pre filtering first queries the SQL table, gets hashed_ids for the filtered documents, and passes it to hnsw's new feature that can filter vectors based on their ids during search.

Notes:
query builder became quite complex even tho I tried decomposing it as much as possible.
i think codecov has some problems?

Signed-off-by: jupyterjazz <[email protected]>
@jupyterjazz jupyterjazz linked an issue Jul 20, 2023 that may be closed by this pull request
2 tasks
@codecov
Copy link

codecov bot commented Jul 20, 2023

Codecov Report

Patch coverage: 39.56% and project coverage change: -1.06% ⚠️

Comparison is base (7ad70bf) 85.54% compared to head (e67dfba) 84.49%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1718      +/-   ##
==========================================
- Coverage   85.54%   84.49%   -1.06%     
==========================================
  Files         133      133              
  Lines        8608     8733     +125     
==========================================
+ Hits         7364     7379      +15     
- Misses       1244     1354     +110     
Flag Coverage Δ
docarray 84.49% <39.56%> (-1.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
docarray/index/backends/hnswlib.py 74.34% <39.56%> (-20.99%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: jupyterjazz <[email protected]>
Signed-off-by: jupyterjazz <[email protected]>
@jupyterjazz jupyterjazz marked this pull request as ready for review July 21, 2023 10:15
Signed-off-by: jupyterjazz <[email protected]>
Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it in general. Do we have an idea on the amount of extra space it takes to index all these into SQLIte? Should we have a DBConfig entry abou tthe fields that are to be filterable?

},
# `None` is not a Type, but we allow it here anyway
None: {}, # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really know why it was here, but why removing it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because python_type_to_db_type function was returning None for types other than vector, but I changed it with defaultdict which returns an empty dict by default

@@ -206,6 +217,15 @@ def python_type_to_db_type(self, python_type: Type) -> Any:
if safe_issubclass(python_type, allowed_type):
return np.ndarray

type_map = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we must be clear on what items are filterable and which are not. To be mindful when documenting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update docs here #1678

@jupyterjazz jupyterjazz requested a review from JoanFM July 25, 2023 11:34
@github-actions
Copy link

📝 Docs are deployed on https://ft-feat-hnsw-prefiltering--jina-docs.netlify.app 🎉

@JoanFM JoanFM merged commit 00e980d into main Jul 26, 2023
@JoanFM JoanFM deleted the feat-hnsw-prefiltering branch July 26, 2023 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: filtering in HNSW
2 participants