Skip to content

Meta: DocArray v2 Roadmap #780

Closed
Closed
@JohannesMessner

Description

@JohannesMessner

DocArray v2

This issue outlines the roadmap for DocArray v2 (this is an internal name, the actual version will still be 0.x.y).

If you want to get a general overview of why we are doing this rewrite, and what our general vision is, check out the poject's readme and this blog post.

But in a nutshell: We are building a library for representing, sending, and storing multimodal data, with a deep integration with pydantic, and mutimodal ML and Neural Search as flagship use cases.

Roadmap

Below you can find the rough roadmap for this rewrite.

We plan to release alpha versions and dev update blogs for every milestone that we reach, with smaller updates along the way.

As we are at the beginning of this effort, the later stages of this roadmap are not fully fleshed out yet, so take this issue as a living document!

alpha-v0.1.0

Target timeline: Before end of year 2022

What's inside:

DocArray is a library that that lets you represent, send, and work on multimodal data.
The first alpha version will tackle the basic aspects of all three of these, but with a limited feature set.

We consider the problems that are tackled in the first alpha version as essential to the future of DocArray.

The implementation will be divided into three different phases:

  1. Data representation (Target timeline: Done)
    i. Support basic data types for image and text data
  1. Use case: Vector search system (alpha-v0.0.1) (Target timeline: Dec 15 2022)

  2. Use case: Machine learning, training (alpha-v0.0.n+1) (Target timeline: Dec 15 2022)

  3. Nested data (alpha-v0.0.n+1) (Target timeline: Dec 31 2022)

alpha-v0.2.0

Target timeline: Feb 15 2023

The plan for the second alpha version (and following) is to iterate on the basic ideas introduced in alpha-v0.1.0.
For now this means:

  1. Util methods (Target timeline: Feb 15 2023)

  2. Support for more data types (Target timeline: Jan 15 2023)

  3. Tensorflow support (Target timeline: Feb 15 2023)

  4. Support for LegacyDocument.

alpha-v0.3.0

Target timeline: Feb 28 2023

  1. Data visualization (Target timeline: Feb 28 2023)

  2. More serialization options (Target timeline: Feb 28 2023)

  3. Support parallel processing and array like access on DocumentArray (Target timeline: Feb 28 2023)

alpha-v0.4.0

Target timeline: Mar 15 2023

This version will focus on introducing vector databases (and potentially other data storage options) into the library.

  1. Implement DocumentStore class (Target timeline: Feb 28 2023) feat: hnswlib document index #1124

  2. Support the following storage backends (already supported in legacy DocArray): (Target timeline: Mar 15 2023)

    • ElasticSearch
    • Qdrant
    • Weaviate
    • HNSW + SQLite
  3. Nested access on Document, DocumentArray, and DocumentStore

  4. Support for push()/pull() to hub

Release version

Target timeline: third weeks of April 2023

  1. Support for reading from another data format:

  2. Better dev life experience:

Post-release

Here we just round off the stuff we will have started earlier.

  1. Support tensor types for more ML frameworks (Target timeline: Mar 30 2023)
    • HuggingFace safe tensor
    • Scipy
    • Jax (potentially)
    • Sparse tensors for all of the above

Potential features

There are a number of features and use cases that we are thinking about, but are not yet sure if and how they should find their way into the library.

Even if we decide to implement these features, they might not make it into one of the alpha versions. But since we are laying the foundation for everything else to come, we want to consider these from the start.

If you have input on these, please let us know!

  • Support for MongoDB: This could be an interesting candidate for a Document Store backend; it does not have vector search capabilities, but the Document focused design seems like a good fit.
  • Support for S3 storage: This is another option, but it might not fit into our Document Store concept, since it is usually more of a source of data rather than something you continually work with and modify. We are interested to know about different usage patterns and ideas about how to integrate this into DocArray.
  • Native support for Jax: If you are a a user of Jax, let us know! We are considering expanding our ML framework / tensor support to include it natively.

You can start a discussion on Github Discussions, or join our Discord server.

Changelog:

  • Jan 11/23: Move nested access to storage backend section and add Jina support
  • Jan 26/23: De-prioritize map/batch/reduce/... operations and adjust timeline accordingly
  • Jan 27/23: remove "Jina support" as it is a Jina concern, not a DocArray concern. Add access by id and move alpha-0.3.0 target date
  • Feb 07/23: rearrange ROADMAP. Remove access by id. Delay map, apply, etc...

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocArray v2This issue is part of the rewrite; not to be merged into main

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions