feat: InMemoryExactNNIndex pre filtering #1713

jupyterjazz · 2023-07-19T07:52:13Z

Support arbitrary number of find/filter operations for InMemoryExactNNIndex, enabling pre+post filtering

From now on, you can build queries like this:

query = (
    doc_index.build_query()
    .filter(filter_query={'price': {'$lte': 3}})
    .find(query=np.ones(10), search_field='tensor')
    .filter(filter_query={'text': {'$eq': 'hello 1'}})
    .build()
)

Note:
how limits work
Since developers could provide limit in any component of the query, and we also had an internal default value for find, the results were confusing. What I'm doing is the following: I'm not applying any limits during the operations, but I'm remembering the lowest one provided, and apply that limit in the end.

Signed-off-by: jupyterjazz <[email protected]>

codecov · 2023-07-19T07:56:59Z

Codecov Report

Patch coverage: 11.11% and project coverage change: -0.15 ⚠️

Comparison is base (68b0c5b) 85.50% compared to head (47a8d9b) 85.36%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1713      +/-   ##
==========================================
- Coverage   85.50%   85.36%   -0.15%     
==========================================
  Files         132      132              
  Lines        8308     8323      +15     
==========================================
+ Hits         7104     7105       +1     
- Misses       1204     1218      +14

Flag	Coverage Δ
docarray	`85.36% <11.11%> (-0.15%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
docarray/index/backends/in_memory.py	`43.52% <11.11%> (-3.11%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: jupyterjazz <[email protected]>

JoanFM · 2023-07-19T08:24:51Z

docarray/index/backends/in_memory.py

        return find_res

+    def _hybrid_search(self, query: List[Tuple[str, Dict]]) -> FindResult:


hybrid search is not that. This is find and filter, let's change the name of this method for clarity

well technically it is but ok

no, hybrid search is about mixing bm25 scores with embedding scores. Not about filtering + search

I think hybrid search means combining more than one search methods. bm25 + vector search is just one case.

Even we use it like that 😄
https://docs.docarray.org/user_guide/storing/docindex/#hybrid-search-through-the-query-builder

Signed-off-by: jupyterjazz <[email protected]>

JoanFM

also test name to be changed

Signed-off-by: jupyterjazz <[email protected]>

JoanFM · 2023-07-19T09:28:29Z

docarray/index/backends/in_memory.py

+
+    def _find_and_filter(self, query: List[Tuple[str, Dict]]) -> FindResult:
+        """
+        Executes a hybrid search on documents based on the provided query.


change docstring as well pls

I think we should keep this term, we (and not only us) use it in docs and even though most ppl think of bm25 when talking about hybrid search, this is not wrong

we use find and filter which is what it is

JoanFM · 2023-07-19T09:29:22Z

docarray/index/backends/in_memory.py

+        """
+        out_docs = self._docs
+        doc_to_score: Dict[BaseDoc, Any] = {}
+        limit = sys.maxsize


why don't u just do limit=10 here?

I refactored limit logic, uses whatever is passed, and if nothing's passed goes with len(out_docs)

JoanFM · 2023-07-19T09:30:34Z

docarray/index/backends/in_memory.py

+                    index=out_docs,
+                    query=op_kwargs['query'],
+                    search_field=op_kwargs['search_field'],
+                    limit=len(out_docs),


I think limit should be the limit obtained or the ln(out_docs) if no limit present

good point, I think I made it too complicated

Signed-off-by: jupyterjazz <[email protected]>

JoanFM · 2023-07-19T09:50:04Z

docarray/index/backends/in_memory.py

                    metric=self._column_infos[op_kwargs['search_field']].config[
                        'space'
                    ],
                )
                doc_to_score.update(zip(out_docs.id, scores))
            elif op == 'filter':
                out_docs = filter_docs(out_docs, op_kwargs['filter_query'])
+                out_docs = out_docs[: op_kwargs.get('limit', len(out_docs))]


check if limit is there before doing this, just for optimization

JoanFM · 2023-07-19T09:57:11Z

docarray/index/backends/in_memory.py

+
+    def _find_and_filter(self, query: List[Tuple[str, Dict]]) -> FindResult:
+        """
+        Executes a hybrid search on documents based on the provided query.


we use find and filter which is what it is

Signed-off-by: jupyterjazz <[email protected]>

github-actions · 2023-07-19T10:03:33Z

📝 Docs are deployed on https://ft-feat-inmemory-pre-filtering--jina-docs.netlify.app 🎉

feat: inmemory pre filtering

273e94e

Signed-off-by: jupyterjazz <[email protected]>

jupyterjazz linked an issue Jul 19, 2023 that may be closed by this pull request

Enable pre + post filtering in ExactInMemoryNNIndex with QueryBuilder #1554

Closed

github-actions bot added size/s area/core area/testing labels Jul 19, 2023

refactor: remove unused lines

c90b1ff

Signed-off-by: jupyterjazz <[email protected]>

JoanFM requested changes Jul 19, 2023

View reviewed changes

refactor: fn name

8df5bc6

Signed-off-by: jupyterjazz <[email protected]>

JoanFM requested changes Jul 19, 2023

View reviewed changes

refactor: fn namez

e4dab39

Signed-off-by: jupyterjazz <[email protected]>

github-actions bot added size/m and removed size/s labels Jul 19, 2023

JoanFM requested changes Jul 19, 2023

View reviewed changes

refactor: limit logic

e96a72d

Signed-off-by: jupyterjazz <[email protected]>

JoanFM requested changes Jul 19, 2023

View reviewed changes

refactor: avoid slicing if not needed

47a8d9b

Signed-off-by: jupyterjazz <[email protected]>

github-actions bot added size/s and removed size/m labels Jul 19, 2023

JoanFM merged commit c96707a into main Jul 19, 2023

JoanFM deleted the feat-inmemory-pre-filtering branch July 19, 2023 11:20

jupyterjazz mentioned this pull request Aug 1, 2023

Release Notes v0.37.0 #1740

Closed

		return find_res

		def _hybrid_search(self, query: List[Tuple[str, Dict]]) -> FindResult:

feat: InMemoryExactNNIndex pre filtering #1713

feat: InMemoryExactNNIndex pre filtering #1713

Uh oh!

Conversation

jupyterjazz commented Jul 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoanFM left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 19, 2023

Uh oh!

Uh oh!

jupyterjazz commented Jul 19, 2023 •

edited

Loading

codecov bot commented Jul 19, 2023 •

edited

Loading