Skip to content

feat: InMemoryExactNNIndex pre filtering #1713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 19, 2023
Merged

Conversation

jupyterjazz
Copy link
Contributor

@jupyterjazz jupyterjazz commented Jul 19, 2023

Support arbitrary number of find/filter operations for InMemoryExactNNIndex, enabling pre+post filtering

From now on, you can build queries like this:

query = (
    doc_index.build_query()
    .filter(filter_query={'price': {'$lte': 3}})
    .find(query=np.ones(10), search_field='tensor')
    .filter(filter_query={'text': {'$eq': 'hello 1'}})
    .build()
)

Note:
how limits work
Since developers could provide limit in any component of the query, and we also had an internal default value for find, the results were confusing. What I'm doing is the following: I'm not applying any limits during the operations, but I'm remembering the lowest one provided, and apply that limit in the end.

Signed-off-by: jupyterjazz <[email protected]>
@codecov
Copy link

codecov bot commented Jul 19, 2023

Codecov Report

Patch coverage: 11.11% and project coverage change: -0.15 ⚠️

Comparison is base (68b0c5b) 85.50% compared to head (47a8d9b) 85.36%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1713      +/-   ##
==========================================
- Coverage   85.50%   85.36%   -0.15%     
==========================================
  Files         132      132              
  Lines        8308     8323      +15     
==========================================
+ Hits         7104     7105       +1     
- Misses       1204     1218      +14     
Flag Coverage Δ
docarray 85.36% <11.11%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/index/backends/in_memory.py 43.52% <11.11%> (-3.11%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

return find_res

def _hybrid_search(self, query: List[Tuple[str, Dict]]) -> FindResult:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hybrid search is not that. This is find and filter, let's change the name of this method for clarity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well technically it is but ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, hybrid search is about mixing bm25 scores with embedding scores. Not about filtering + search

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hybrid search means combining more than one search methods. bm25 + vector search is just one case.

Even we use it like that 😄
https://docs.docarray.org/user_guide/storing/docindex/#hybrid-search-through-the-query-builder

Signed-off-by: jupyterjazz <[email protected]>
Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also test name to be changed

Signed-off-by: jupyterjazz <[email protected]>
@github-actions github-actions bot added size/m and removed size/s labels Jul 19, 2023

def _find_and_filter(self, query: List[Tuple[str, Dict]]) -> FindResult:
"""
Executes a hybrid search on documents based on the provided query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change docstring as well pls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this term, we (and not only us) use it in docs and even though most ppl think of bm25 when talking about hybrid search, this is not wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use find and filter which is what it is

"""
out_docs = self._docs
doc_to_score: Dict[BaseDoc, Any] = {}
limit = sys.maxsize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't u just do limit=10 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored limit logic, uses whatever is passed, and if nothing's passed goes with len(out_docs)

index=out_docs,
query=op_kwargs['query'],
search_field=op_kwargs['search_field'],
limit=len(out_docs),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think limit should be the limit obtained or the ln(out_docs) if no limit present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I think I made it too complicated

Signed-off-by: jupyterjazz <[email protected]>
metric=self._column_infos[op_kwargs['search_field']].config[
'space'
],
)
doc_to_score.update(zip(out_docs.id, scores))
elif op == 'filter':
out_docs = filter_docs(out_docs, op_kwargs['filter_query'])
out_docs = out_docs[: op_kwargs.get('limit', len(out_docs))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if limit is there before doing this, just for optimization


def _find_and_filter(self, query: List[Tuple[str, Dict]]) -> FindResult:
"""
Executes a hybrid search on documents based on the provided query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use find and filter which is what it is

@github-actions github-actions bot added size/s and removed size/m labels Jul 19, 2023
@github-actions
Copy link

📝 Docs are deployed on https://ft-feat-inmemory-pre-filtering--jina-docs.netlify.app 🎉

@JoanFM JoanFM merged commit c96707a into main Jul 19, 2023
@JoanFM JoanFM deleted the feat-inmemory-pre-filtering branch July 19, 2023 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable pre + post filtering in ExactInMemoryNNIndex with QueryBuilder
2 participants