Skip to content

MVP goals #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 79 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,82 @@
## Postgres ML demo
## PostgresML

PostgresML aims to be the easiest way to gain value from machine learning. Anyone with a basic understanding of SQL should be able to build and deploy models to production, while receiving the benefits of a high performance machine learning platform. PostgresML leverages state of the art algorithms with built in best practices, without having to setup additional infrastructure or learn additional programming languages.

Getting started is as easy as creating a `table` or `view` that holds the training data, and then registering that with PostgresML.

```sql
SELECT pgml.model_regression('Red Wine Quality', training_data_table_or_view_name, label_column_name);
```

And predict novel datapoints:

```sql
SELECT pgml.predict('Red Wine Quality', red_wines.*)
FROM pgml.red_wines
LIMIT 3;

quality
---------
0.896432
0.834822
0.954502
(3 rows)
```

PostgresML similarly supports classification to predict discrete classes rather than numeric scores for novel data.

```sql
SELECT pgml.create_classification('Handwritten Digit Classifier', pgml.mnist_training_data, label_column_name);
```

And predict novel datapoints:

```sql
SELECT pgml.predict('Handwritten Digit Classifier', pgml.mnist_test_data.*)
FROM pgml.mnist
LIMIT 1;

digit | likelihood
-------+----
5 | 0.956432
(1 row)
```

Checkout the [documentation](https://TODO) to view the full capabilities, including:
- [Creating Training Sets](https://TODO)
- [Classification](https://TODO)
- [Regression](https://TODO)
- [Supported Algorithms](https://TODO)
- [Scikit Learn](https://TODO)
- [XGBoost](https://TODO)
- [Tensorflow](https://TODO)
- [PyTorch](https://TODO)

### Planned features
- Model management dashboard
- Data explorer
- More algorithms and libraries incluiding custom algorithm support


### FAQ

*How well does this scale?*

Petabyte sized Postgres deployements are [documented](https://www.computerworld.com/article/2535825/size-matters--yahoo-claims-2-petabyte-database-is-world-s-biggest--busiest.html) in production since at least 2008, and [recent patches](https://www.2ndquadrant.com/en/blog/postgresql-maximum-table-size/) have enabled working beyond exabyte up to the yotabyte scale. Machine learning models can be horizontally scaled using well tested Postgres replication techniques on top of a mature storage and compute platform.

*How reliable is this system?*

Postgres is widely considered mission critical, and some of the most [reliable](https://www.postgresql.org/docs/current/wal-reliability.html) technology in any modern stack. PostgresML allows an infrastructure organization to leverage pre-existing best practices to deploy machine learning into production with less risk and effort than other systems. For example, model backup and recovery happens automatically alongside normal data backup procedures.

*How good are the models?*

Model quality is often a tradeoff between compute resources and incremental quality improvements. PostgresML allows stakeholders to choose algorithms from several libraries that will provide the most bang for the buck. In addition, PostgresML automatically applies best practices for data cleaning like imputing missing values by default and normalizing data to prevent common problems in production. After quickly enabling 0 to 1 value creation, PostgresML enables further expert iteration with custom data preperation and algorithm implementations. Like most things in life, the ultimate in quality will be a concerted effort of experts working over time, but that shouldn't get in the way of a quick start.

*Is PostgresML fast?*

Colocating the compute with the data inside the database removes one of the most common latency bottlenecks in the ML stack, which is the (de)serialization of data between stores and services across the wire. Modern versions of Postgres also support automatic query parrellization across multiple workers to further minimize latency in large batch workloads. Finally, PostgresML will utilize GPU compute if both the algorithm and hardware support it, although it is currently rare in practice for production databases to have GPUs. Checkout our [benchmarks](https://todo).


Quick demo with Postgres, PL/Python, and Scikit.

### Installation in WSL or Ubuntu

Expand Down Expand Up @@ -29,11 +105,9 @@ Install Scikit globally (I didn't bother setup Postgres with a virtualenv, but i
sudo pip3 install sklearn
```

### Run the demo
### Run the example

```bash
sudo mkdir /app/models
sudo chown postgres:postgres /app/models
psql -f scikit_train_and_predict.sql
```

Expand Down
23 changes: 23 additions & 0 deletions benchmarks.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
--
-- CREATE EXTENSION
--
CREATE EXTENSION IF NOT EXISTS plpython3u;

CREATE OR REPLACE FUNCTION pg_call()
RETURNS INT
AS $$
BEGIN
RETURN 1;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION py_call()
RETURNS INT
AS $$
return 1;
$$ LANGUAGE plpython3u;

\timing on
SELECT generate_series(1, 50000), pg_call(); -- Time: 20.679 ms
SELECT generate_series(1, 50000), py_call(); -- Time: 67.355 ms

Loading