-
Notifications
You must be signed in to change notification settings - Fork 325
MVP goals #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MVP goals #1
Conversation
I like this; it's not too technical yet describes what we are doing well. Depending on who we're targeting, I wouldn't worry about scaling too much, but it's good to know that we can potentially be exabyte scale. |
… we will be more liberal with data types we accept
3be240e
to
14b1f61
Compare
8e885b9
to
9907aaa
Compare
name_parts := string_to_array(tbl::TEXT, '.'); | ||
name := name_parts[array_upper(name_parts, 1)]; | ||
|
||
EXECUTE format('DROP TRIGGER IF EXISTS %s_auto_updated_at ON %s', name, tbl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this whole function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other function is the trigger itself. This function makes it easy to add that trigger to a table with a one liner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Why do you need the trigger? :)
sql/test.sql
Outdated
|
||
-- features as variadic arguments | ||
7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4) AS score; | ||
SELECT pgml.model_regression('Red Wine', 'wine_quality_red', 'quality'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need versioning for the models. The most common use case is training a model once per week for example, and promoting it based on better scoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that will be multiple models per single project.
Co-authored-by: Lev Kokotov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add more docs and tests
pgml/tests/test_train.py
Outdated
train(it, y_column="weight", name="test", save=False) | ||
class TestRegression(unittest.TestCase): | ||
def test_init(self): | ||
pgml.model.train("Test", "regression", "test", "test_y") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
|
||
SELECT * FROM pgml.model_versions; | ||
SELECT pgml.train('Red Wine Scores', 'regression', 'wine_quality_red', 'quality'); | ||
SELECT pgml.predict('Red Wine Scores', 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know which version of the model you're predicting against? If you train multiple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, the "Best" model gets promoted at the end of the training runs. We can build a real promotion system to handle all the bells and whistles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really cool!
pgml/pgml/model.py
Outdated
y.append(y_) | ||
|
||
# Split into training and test sets | ||
if (self.test_sampling == 'random'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need parentheses in Python for if
-statements.
pgml/pgml/model.py
Outdated
if (self.test_sampling == 'random'): | ||
return train_test_split(X, y, test_size=self.test_size, random_state=0) | ||
else: | ||
if (self.test_sampling == 'first'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this technique do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For something like time series data, you don't want to random sample for the test data. You want to use a chunk of data at the end after all the training data, to avoid leaking information from the future into the model.
""") | ||
model = Model() | ||
model.__dict__ = dict(result[0]) | ||
model.__init__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to call __init__
explicitly I don't think? __init__
gets called when you create the object, e.g. Model()
is effectively a call to __new__()
and __init__()
sequentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand now, when using __dict__
, __init__()
isn't called, nice.
Btw you can use pip install black
black ./ |
def algorithm(self): | ||
if self._algorithm is None: | ||
if self.pickle is not None: | ||
self._algorithm = pickle.loads(self.pickle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw you pickling weights only in the previous iteration of this code; did you decide to pickle the whole object instead? I think in general they should be pretty small so this will work, until we may get to NNs with lots of layers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I even there larger NN's we used at IC are only a few megabytes. I think this will likely work up to the point someone wants to run something like GPT-3...
best_model = None | ||
best_error = None | ||
if objective == "regression": | ||
algorithms = ["linear", "random_forest"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put classes here, just like in Ruby:
algorithms = [LinearRegression, RandomForestRegressor]
although, maybe that's not a good idea...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was figuring at some point we'll want the user to be able to pass algorithm names in from sql as an option as well.
Co-authored-by: Lev Kokotov <[email protected]>
class TestModel(unittest.TestCase): | ||
def test_the_world(self): | ||
plpy.add_mock_result( | ||
[{"id": 1, "name": "Test", "objective": "regression", "created_at": time.time(), "updated_at": time.time()}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just wrap psycopg2
and get a more real test, e.g. test your schema, etc.
pgml is not compatible with plpython, if using both pgml and plpython in the same session, postgresql will crash. minimum reproducible code: ```sql SELECT pgml.embed('intfloat/e5-small', 'hi mom'); create or replace function pyudf() returns int as $$ return 0 $$ language 'plpython3u'; ``` the call stack: ``` Stack trace of thread 161970: #0 0x00007efc1429edb8 PyImport_Import (libpython3.9.so.1.0 + 0x9edb8) postgresml#1 0x00007efc1429f125 PyImport_ImportModule (libpython3.9.so.1.0 + 0x9f125) postgresml#2 0x00007efb04b0f496 n/a (plpython3.so + 0x10496) postgresml#3 0x00007efb04b1039d plpython3_validator (plpython3.so + 0x1139d) postgresml#4 0x0000559d0cdbc5c2 OidFunctionCall1Coll (postgres + 0x6465c2) postgresml#5 0x0000559d0c9d68bb ProcedureCreate (postgres + 0x2608bb) postgresml#6 0x0000559d0ca5030c CreateFunction (postgres + 0x2da30c) postgresml#7 0x0000559d0ce1c730 n/a (postgres + 0x6a6730) postgresml#8 0x0000559d0cc5a030 standard_ProcessUtility (postgres + 0x4e4030) postgresml#9 0x0000559d0cc545ed n/a (postgres + 0x4de5ed) postgresml#10 0x0000559d0cc546e7 n/a (postgres + 0x4de6e7) postgresml#11 0x0000559d0cc54beb PortalRun (postgres + 0x4debeb) postgresml#12 0x0000559d0cc55249 n/a (postgres + 0x4df249) postgresml#13 0x0000559d0cc576f0 PostgresMain (postgres + 0x4e16f0) postgresml#14 0x0000559d0cbc3e9c n/a (postgres + 0x44de9c) postgresml#15 0x0000559d0cbc50aa PostmasterMain (postgres + 0x44f0aa) postgresml#16 0x0000559d0c8ce7d2 main (postgres + 0x1587d2) postgresml#17 0x00007efc18427cd0 n/a (libc.so.6 + 0x27cd0) postgresml#18 0x00007efc18427d8a __libc_start_main (libc.so.6 + 0x27d8a) postgresml#19 0x0000559d0c8cee15 _start (postgres + 0x158e15) ``` this is because PostgreSQL is using dlopen(RTLD_GLOBAL). this will parse some of symbols into the previous opened .so file, but the others will use a relative offset in pgml.so, and will cause a null-pointer crash. this commit hide all symbols except the UDF symbols (ends with `_wrapper`) and the magic symbols (`_PG_init` `Pg_magic_func`). so dlopen(RTLD_GLOBAL) will parse the symbols to the correct position.
Add some verbage to outline the target use case