Description
DocArray v2
This issue outlines the roadmap for DocArray v2 (this is an internal name, the actual version will still be 0.x.y).
If you want to get a general overview of why we are doing this rewrite, and what our general vision is, check out the poject's readme and this blog post.
But in a nutshell: We are building a library for representing, sending, and storing multimodal data, with a deep integration with pydantic, and mutimodal ML and Neural Search as flagship use cases.
Roadmap
Below you can find the rough roadmap for this rewrite.
We plan to release alpha versions and dev update blogs for every milestone that we reach, with smaller updates along the way.
As we are at the beginning of this effort, the later stages of this roadmap are not fully fleshed out yet, so take this issue as a living document!
alpha-v0.1.0
Target timeline: Before end of year 2022
What's inside:
DocArray is a library that that lets you represent, send, and work on multimodal data.
The first alpha version will tackle the basic aspects of all three of these, but with a limited feature set.
We consider the problems that are tackled in the first alpha version as essential to the future of DocArray.
The implementation will be divided into three different phases:
- Data representation (Target timeline: Done)
i. Support basic data types for image and text data
-
str
-
Tensor
for numpy tensors -
ImageURI
Type: ImageURI #784 -
TextURI
Type: TextURI #785 -
Embedding
Type: Embedding #786
ii. Provide pre-built Documents -
Image
Pre-built: Image #787 -
Text
Pre-built: Text #788
iii. Basic implementation of DocumentArray
-
Use case: Vector search system (
alpha-v0.0.1
) (Target timeline: Dec 15 2022)- Ensure compatibility with FastAPI Make compatible with fastapi #838
- Implement
find()
on DocumentArray level. Basic implementation that can perform search on root-level embeddings (no support for search on nested levels; this will come later) feat: find function #931
-
Use case: Machine learning, training (
alpha-v0.0.n+1
) (Target timeline: Dec 15 2022)- Support for PyTorch tensor data type Type: PyTorch tensor #783
- Support for torch tensors and numpy in column-wise mode ("stacked mode") (other frameworks will follow later) feat(v2): tensor column mode #886
- Ensure compatibility with pytorch modules and pytorch lightning
-
Nested data (
alpha-v0.0.n+1
) (Target timeline: Dec 31 2022)- Nested access on DocumentArray ("access paths") Investigate nested access from DocumentArray ("access paths") #957
alpha-v0.2.0
Target timeline: Feb 15 2023
The plan for the second alpha version (and following) is to iterate on the basic ideas introduced in alpha-v0.1.0
.
For now this means:
-
Util methods (Target timeline: Feb 15 2023)
- filter with query language feat: add filter capability to DocumentArray #1051
- reduce feat: reduce and update methods for DocumentArray and BaseDocument #1076
- array like access with the getitem call feat: advanced indexing #1074
-
Support for more data types (Target timeline: Jan 15 2023)
- Video feat(v2): add video support #972
- Audio feat(v2): add audio url and predefined document #940
- 3D meshes feat(v2): add 3d data handling #925
- support bytes field in current type
-
Tensorflow support (Target timeline: Feb 15 2023)
- Tensforflow tensor type feat(v2): add tensorflow support #1064
- Tensorflow embedding type feat(v2): add tensorflow embedding, audio, video #1098
-
Support for LegacyDocument.
- Provide legacy Document feat: add v1 equivalent Document #1090
alpha-v0.3.0
Target timeline: Feb 28 2023
-
Data visualization (Target timeline: Feb 28 2023)
- Pretty print and summary of Document and DocumentArray feat(v2): rich display for doc and da #1043
- Plotting for
-
More serialization options (Target timeline: Feb 28 2023)
- base64
- bytes
- bytes in streaming mode (see https://docs.docarray.org/fundamentals/documentarray/serialization/#from-to-bytes)
- json
-
Support parallel processing and array like access on DocumentArray (Target timeline: Feb 28 2023)
alpha-v0.4.0
Target timeline: Mar 15 2023
This version will focus on introducing vector databases (and potentially other data storage options) into the library.
-
Implement
DocumentStore
class (Target timeline: Feb 28 2023) feat: hnswlib document index #1124 -
Support the following storage backends (already supported in legacy DocArray): (Target timeline: Mar 15 2023)
- ElasticSearch
- Qdrant
- Weaviate
- HNSW + SQLite
-
Nested access on Document, DocumentArray, and DocumentStore
-
find
on nested data/documents feat: nested attribute access infind()
#1176
-
-
Support for push()/pull() to hub
Release version
Target timeline: third weeks of April 2023
-
Support for reading from another data format:
-
Better dev life experience:
- Mypy plugin #1236
- Pycharm plugin + Fix pycharm problem
- Make ourselves mypy compatible #1237
Post-release
Here we just round off the stuff we will have started earlier.
- Support tensor types for more ML frameworks (Target timeline: Mar 30 2023)
- HuggingFace safe tensor
- Scipy
- Jax (potentially)
- Sparse tensors for all of the above
Potential features
There are a number of features and use cases that we are thinking about, but are not yet sure if and how they should find their way into the library.
Even if we decide to implement these features, they might not make it into one of the alpha versions. But since we are laying the foundation for everything else to come, we want to consider these from the start.
If you have input on these, please let us know!
- Support for MongoDB: This could be an interesting candidate for a Document Store backend; it does not have vector search capabilities, but the Document focused design seems like a good fit.
- Support for S3 storage: This is another option, but it might not fit into our Document Store concept, since it is usually more of a source of data rather than something you continually work with and modify. We are interested to know about different usage patterns and ideas about how to integrate this into DocArray.
- Native support for Jax: If you are a a user of Jax, let us know! We are considering expanding our ML framework / tensor support to include it natively.
You can start a discussion on Github Discussions, or join our Discord server.
Changelog:
- Jan 11/23: Move nested access to storage backend section and add Jina support
- Jan 26/23: De-prioritize map/batch/reduce/... operations and adjust timeline accordingly
- Jan 27/23: remove "Jina support" as it is a Jina concern, not a DocArray concern. Add access by
id
and move alpha-0.3.0 target date - Feb 07/23: rearrange ROADMAP. Remove access by
id
. Delaymap, apply, etc...
Metadata
Metadata
Assignees
Type
Projects
Status