MVP goals #1

montanalow · 2022-04-12T00:43:18Z

Add some verbage to outline the target use case

levkk · 2022-04-12T14:12:39Z

I like this; it's not too technical yet describes what we are doing well. Depending on who we're targeting, I wouldn't worry about scaling too much, but it's good to know that we can potentially be exabyte scale.

README.md

… we will be more liberal with data types we accept

…r a project

pgml/pgml/model.py

pgml/tests/test_train.py

levkk · 2022-04-13T23:25:08Z

sql/install.sql

+    name_parts := string_to_array(tbl::TEXT, '.');
+    name := name_parts[array_upper(name_parts, 1)];
+
+    EXECUTE format('DROP TRIGGER IF EXISTS %s_auto_updated_at ON %s', name, tbl);


Why do you need this whole function?

The other function is the trigger itself. This function makes it easy to add that trigger to a table with a one liner.

I see. Why do you need the trigger? :)

sql/install.sql

levkk · 2022-04-13T23:29:12Z

sql/test.sql


-	-- features as variadic arguments
-	7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4) AS score;
+SELECT pgml.model_regression('Red Wine', 'wine_quality_red', 'quality');


We need versioning for the models. The most common use case is training a model once per week for example, and promoting it based on better scoring.

Yep, that will be multiple models per single project.

Co-authored-by: Lev Kokotov <[email protected]>

…o montana/readme

montanalow

Need to add more docs and tests

montanalow · 2022-04-14T03:27:48Z

pgml/tests/test_train.py

-    train(it, y_column="weight", name="test", save=False)
+class TestRegression(unittest.TestCase):
+    def test_init(self):
+        pgml.model.train("Test", "regression", "test", "test_y")


levkk · 2022-04-14T03:52:52Z

sql/test.sql


-SELECT * FROM pgml.model_versions;
+SELECT pgml.train('Red Wine Scores', 'regression', 'wine_quality_red', 'quality');
+SELECT pgml.predict('Red Wine Scores', 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4);


How do you know which version of the model you're predicting against? If you train multiple.

For now, the "Best" model gets promoted at the end of the training runs. We can build a real promotion system to handle all the bells and whistles.

That's really cool!

levkk · 2022-04-14T05:49:23Z

pgml/pgml/model.py

+            y.append(y_)
+
+        # Split into training and test sets
+        if (self.test_sampling == 'random'):


Don't need parentheses in Python for if-statements.

levkk · 2022-04-14T05:50:16Z

pgml/pgml/model.py

+        if (self.test_sampling == 'random'):
+            return train_test_split(X, y, test_size=self.test_size, random_state=0)
+        else:
+            if (self.test_sampling == 'first'):


What does this technique do?

For something like time series data, you don't want to random sample for the test data. You want to use a chunk of data at the end after all the training data, to avoid leaking information from the future into the model.

pgml/pgml/model.py

levkk · 2022-04-14T06:04:18Z

pgml/pgml/model.py

+        """)
+        model = Model()
+        model.__dict__ = dict(result[0])
+        model.__init__()


You don't need to call __init__ explicitly I don't think? __init__ gets called when you create the object, e.g. Model() is effectively a call to __new__() and __init__() sequentially.

I understand now, when using __dict__, __init__() isn't called, nice.

levkk · 2022-04-14T06:14:05Z

Btw you can use black to auto-format/lint the code, e.g.

pip install black
black ./

levkk · 2022-04-14T06:38:25Z

pgml/pgml/model.py

+    def algorithm(self):
+        if self._algorithm is None:
+            if self.pickle is not None:
+                self._algorithm = pickle.loads(self.pickle)


I saw you pickling weights only in the previous iteration of this code; did you decide to pickle the whole object instead? I think in general they should be pretty small so this will work, until we may get to NNs with lots of layers?

Yeah, I even there larger NN's we used at IC are only a few megabytes. I think this will likely work up to the point someone wants to run something like GPT-3...

pgml/pgml/model.py

levkk · 2022-04-14T07:12:14Z

pgml/pgml/model.py

+    best_model = None
+    best_error = None
+    if objective == "regression":
+        algorithms = ["linear", "random_forest"]


You can put classes here, just like in Ruby:

algorithms = [LinearRegression, RandomForestRegressor]

although, maybe that's not a good idea...

I was figuring at some point we'll want the user to be able to pass algorithm names in from sql as an option as well.

benchmarks.sql

Co-authored-by: Lev Kokotov <[email protected]>

levkk · 2022-04-14T18:06:11Z

pgml/tests/test_model.py

+class TestModel(unittest.TestCase):
+    def test_the_world(self):
+        plpy.add_mock_result(
+            [{"id": 1, "name": "Test", "objective": "regression", "created_at": time.time(), "updated_at": time.time()}]


You can just wrap psycopg2 and get a more real test, e.g. test your schema, etc.

…o montana/readme

pgml is not compatible with plpython, if using both pgml and plpython in the same session, postgresql will crash. minimum reproducible code: ```sql SELECT pgml.embed('intfloat/e5-small', 'hi mom'); create or replace function pyudf() returns int as $$ return 0 $$ language 'plpython3u'; ``` the call stack: ``` Stack trace of thread 161970: #0 0x00007efc1429edb8 PyImport_Import (libpython3.9.so.1.0 + 0x9edb8) postgresml#1 0x00007efc1429f125 PyImport_ImportModule (libpython3.9.so.1.0 + 0x9f125) postgresml#2 0x00007efb04b0f496 n/a (plpython3.so + 0x10496) postgresml#3 0x00007efb04b1039d plpython3_validator (plpython3.so + 0x1139d) postgresml#4 0x0000559d0cdbc5c2 OidFunctionCall1Coll (postgres + 0x6465c2) postgresml#5 0x0000559d0c9d68bb ProcedureCreate (postgres + 0x2608bb) postgresml#6 0x0000559d0ca5030c CreateFunction (postgres + 0x2da30c) postgresml#7 0x0000559d0ce1c730 n/a (postgres + 0x6a6730) postgresml#8 0x0000559d0cc5a030 standard_ProcessUtility (postgres + 0x4e4030) postgresml#9 0x0000559d0cc545ed n/a (postgres + 0x4de5ed) postgresml#10 0x0000559d0cc546e7 n/a (postgres + 0x4de6e7) postgresml#11 0x0000559d0cc54beb PortalRun (postgres + 0x4debeb) postgresml#12 0x0000559d0cc55249 n/a (postgres + 0x4df249) postgresml#13 0x0000559d0cc576f0 PostgresMain (postgres + 0x4e16f0) postgresml#14 0x0000559d0cbc3e9c n/a (postgres + 0x44de9c) postgresml#15 0x0000559d0cbc50aa PostmasterMain (postgres + 0x44f0aa) postgresml#16 0x0000559d0c8ce7d2 main (postgres + 0x1587d2) postgresml#17 0x00007efc18427cd0 n/a (libc.so.6 + 0x27cd0) postgresml#18 0x00007efc18427d8a __libc_start_main (libc.so.6 + 0x27d8a) postgresml#19 0x0000559d0c8cee15 _start (postgres + 0x158e15) ``` this is because PostgreSQL is using dlopen(RTLD_GLOBAL). this will parse some of symbols into the previous opened .so file, but the others will use a relative offset in pgml.so, and will cause a null-pointer crash. this commit hide all symbols except the UDF symbols (ends with `_wrapper`) and the magic symbols (`_PG_init` `Pg_magic_func`). so dlopen(RTLD_GLOBAL) will parse the symbols to the correct position.

update instructions

MVP goals

montanalow added the in progress label Apr 12, 2022

montanalow self-assigned this Apr 12, 2022

levkk reviewed Apr 12, 2022

View reviewed changes

README.md Show resolved Hide resolved

levkk reviewed Apr 12, 2022

View reviewed changes

README.md Show resolved Hide resolved

Montana Low added 5 commits April 12, 2022 10:12

MVP goals

92b80f2

Use unittest as the test running harness

f18f276

remove validate because validation has a different meaning in ML, and…

3c66272

… we will be more liberal with data types we accept

keep model in memory to avoid going to disk

958cfba

use bytea directly for pl/python rather than hex/text conversion

14b1f61

montanalow force-pushed the montana/readme branch from 3be240e to 14b1f61 Compare April 12, 2022 21:39

Montana Low added 2 commits April 12, 2022 16:59

add a draft schema to support snapshots and multiple training runs fo…

829b62e

…r a project

sketch out the regression model training cycle

9907aaa

montanalow force-pushed the montana/readme branch from 8e885b9 to 9907aaa Compare April 13, 2022 03:07