-
Notifications
You must be signed in to change notification settings - Fork 233
feat: InMemoryExactNNIndex pre filtering #1713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: jupyterjazz <[email protected]>
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1713 +/- ##
==========================================
- Coverage 85.50% 85.36% -0.15%
==========================================
Files 132 132
Lines 8308 8323 +15
==========================================
+ Hits 7104 7105 +1
- Misses 1204 1218 +14
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
Signed-off-by: jupyterjazz <[email protected]>
docarray/index/backends/in_memory.py
Outdated
return find_res | ||
|
||
def _hybrid_search(self, query: List[Tuple[str, Dict]]) -> FindResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hybrid search is not that. This is find and filter, let's change the name of this method for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well technically it is but ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, hybrid search is about mixing bm25 scores with embedding scores. Not about filtering + search
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hybrid search means combining more than one search methods. bm25 + vector search is just one case.
Even we use it like that 😄
https://docs.docarray.org/user_guide/storing/docindex/#hybrid-search-through-the-query-builder
Signed-off-by: jupyterjazz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also test name to be changed
Signed-off-by: jupyterjazz <[email protected]>
docarray/index/backends/in_memory.py
Outdated
|
||
def _find_and_filter(self, query: List[Tuple[str, Dict]]) -> FindResult: | ||
""" | ||
Executes a hybrid search on documents based on the provided query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change docstring as well pls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this term, we (and not only us) use it in docs and even though most ppl think of bm25 when talking about hybrid search, this is not wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use find and filter which is what it is
docarray/index/backends/in_memory.py
Outdated
""" | ||
out_docs = self._docs | ||
doc_to_score: Dict[BaseDoc, Any] = {} | ||
limit = sys.maxsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't u just do limit=10
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored limit logic, uses whatever is passed, and if nothing's passed goes with len(out_docs)
docarray/index/backends/in_memory.py
Outdated
index=out_docs, | ||
query=op_kwargs['query'], | ||
search_field=op_kwargs['search_field'], | ||
limit=len(out_docs), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think limit should be the limit obtained or the ln(out_docs) if no limit present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I think I made it too complicated
Signed-off-by: jupyterjazz <[email protected]>
docarray/index/backends/in_memory.py
Outdated
metric=self._column_infos[op_kwargs['search_field']].config[ | ||
'space' | ||
], | ||
) | ||
doc_to_score.update(zip(out_docs.id, scores)) | ||
elif op == 'filter': | ||
out_docs = filter_docs(out_docs, op_kwargs['filter_query']) | ||
out_docs = out_docs[: op_kwargs.get('limit', len(out_docs))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if limit is there before doing this, just for optimization
docarray/index/backends/in_memory.py
Outdated
|
||
def _find_and_filter(self, query: List[Tuple[str, Dict]]) -> FindResult: | ||
""" | ||
Executes a hybrid search on documents based on the provided query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use find and filter which is what it is
Signed-off-by: jupyterjazz <[email protected]>
📝 Docs are deployed on https://ft-feat-inmemory-pre-filtering--jina-docs.netlify.app 🎉 |
Support arbitrary number of find/filter operations for
InMemoryExactNNIndex
, enabling pre+post filteringFrom now on, you can build queries like this:
Note:
how limits workSince developers could provide
limit
in any component of the query, and we also had an internal default value forfind
, the results were confusing. What I'm doing is the following: I'm not applying any limits during the operations, but I'm remembering the lowest one provided, and apply that limit in the end.