Skip to content

MVP goals #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 14, 2022
Merged

MVP goals #1

merged 17 commits into from
Apr 14, 2022

Conversation

montanalow
Copy link
Contributor

Add some verbage to outline the target use case

@montanalow montanalow self-assigned this Apr 12, 2022
@levkk
Copy link
Contributor

levkk commented Apr 12, 2022

I like this; it's not too technical yet describes what we are doing well. Depending on who we're targeting, I wouldn't worry about scaling too much, but it's good to know that we can potentially be exabyte scale.

name_parts := string_to_array(tbl::TEXT, '.');
name := name_parts[array_upper(name_parts, 1)];

EXECUTE format('DROP TRIGGER IF EXISTS %s_auto_updated_at ON %s', name, tbl);
Copy link
Contributor

@levkk levkk Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this whole function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other function is the trigger itself. This function makes it easy to add that trigger to a table with a one liner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Why do you need the trigger? :)

sql/test.sql Outdated

-- features as variadic arguments
7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4) AS score;
SELECT pgml.model_regression('Red Wine', 'wine_quality_red', 'quality');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need versioning for the models. The most common use case is training a model once per week for example, and promoting it based on better scoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that will be multiple models per single project.

Copy link
Contributor Author

@montanalow montanalow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add more docs and tests

train(it, y_column="weight", name="test", save=False)
class TestRegression(unittest.TestCase):
def test_init(self):
pgml.model.train("Test", "regression", "test", "test_y")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO


SELECT * FROM pgml.model_versions;
SELECT pgml.train('Red Wine Scores', 'regression', 'wine_quality_red', 'quality');
SELECT pgml.predict('Red Wine Scores', 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4);
Copy link
Contributor

@levkk levkk Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know which version of the model you're predicting against? If you train multiple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, the "Best" model gets promoted at the end of the training runs. We can build a real promotion system to handle all the bells and whistles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really cool!

y.append(y_)

# Split into training and test sets
if (self.test_sampling == 'random'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need parentheses in Python for if-statements.

if (self.test_sampling == 'random'):
return train_test_split(X, y, test_size=self.test_size, random_state=0)
else:
if (self.test_sampling == 'first'):
Copy link
Contributor

@levkk levkk Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this technique do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For something like time series data, you don't want to random sample for the test data. You want to use a chunk of data at the end after all the training data, to avoid leaking information from the future into the model.

""")
model = Model()
model.__dict__ = dict(result[0])
model.__init__()
Copy link
Contributor

@levkk levkk Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to call __init__ explicitly I don't think? __init__ gets called when you create the object, e.g. Model() is effectively a call to __new__() and __init__() sequentially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand now, when using __dict__, __init__() isn't called, nice.

@levkk
Copy link
Contributor

levkk commented Apr 14, 2022

Btw you can use black to auto-format/lint the code, e.g.

pip install black
black ./

def algorithm(self):
if self._algorithm is None:
if self.pickle is not None:
self._algorithm = pickle.loads(self.pickle)
Copy link
Contributor

@levkk levkk Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw you pickling weights only in the previous iteration of this code; did you decide to pickle the whole object instead? I think in general they should be pretty small so this will work, until we may get to NNs with lots of layers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I even there larger NN's we used at IC are only a few megabytes. I think this will likely work up to the point someone wants to run something like GPT-3...

best_model = None
best_error = None
if objective == "regression":
algorithms = ["linear", "random_forest"]
Copy link
Contributor

@levkk levkk Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put classes here, just like in Ruby:

algorithms = [LinearRegression, RandomForestRegressor]

although, maybe that's not a good idea...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was figuring at some point we'll want the user to be able to pass algorithm names in from sql as an option as well.

Montana Low and others added 3 commits April 14, 2022 10:14
class TestModel(unittest.TestCase):
def test_the_world(self):
plpy.add_mock_result(
[{"id": 1, "name": "Test", "objective": "regression", "created_at": time.time(), "updated_at": time.time()}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just wrap psycopg2 and get a more real test, e.g. test your schema, etc.

@levkk levkk merged commit 3624d86 into master Apr 14, 2022
Sasasu added a commit to Sasasu/postgresml that referenced this pull request Aug 18, 2023
pgml is not compatible with plpython, if using both pgml and plpython in the
same session, postgresql will crash.

minimum reproducible code:

```sql
SELECT pgml.embed('intfloat/e5-small', 'hi mom');

create or replace function pyudf()
returns int as
$$
return 0
$$ language 'plpython3u';
```

the call stack:

```
 Stack trace of thread 161970:
 #0  0x00007efc1429edb8 PyImport_Import (libpython3.9.so.1.0 + 0x9edb8)
 postgresml#1  0x00007efc1429f125 PyImport_ImportModule (libpython3.9.so.1.0 + 0x9f125)
 postgresml#2  0x00007efb04b0f496 n/a (plpython3.so + 0x10496)
 postgresml#3  0x00007efb04b1039d plpython3_validator (plpython3.so + 0x1139d)
 postgresml#4  0x0000559d0cdbc5c2 OidFunctionCall1Coll (postgres + 0x6465c2)
 postgresml#5  0x0000559d0c9d68bb ProcedureCreate (postgres + 0x2608bb)
 postgresml#6  0x0000559d0ca5030c CreateFunction (postgres + 0x2da30c)
 postgresml#7  0x0000559d0ce1c730 n/a (postgres + 0x6a6730)
 postgresml#8  0x0000559d0cc5a030 standard_ProcessUtility (postgres + 0x4e4030)
 postgresml#9  0x0000559d0cc545ed n/a (postgres + 0x4de5ed)
 postgresml#10 0x0000559d0cc546e7 n/a (postgres + 0x4de6e7)
 postgresml#11 0x0000559d0cc54beb PortalRun (postgres + 0x4debeb)
 postgresml#12 0x0000559d0cc55249 n/a (postgres + 0x4df249)
 postgresml#13 0x0000559d0cc576f0 PostgresMain (postgres + 0x4e16f0)
 postgresml#14 0x0000559d0cbc3e9c n/a (postgres + 0x44de9c)
 postgresml#15 0x0000559d0cbc50aa PostmasterMain (postgres + 0x44f0aa)
 postgresml#16 0x0000559d0c8ce7d2 main (postgres + 0x1587d2)
 postgresml#17 0x00007efc18427cd0 n/a (libc.so.6 + 0x27cd0)
 postgresml#18 0x00007efc18427d8a __libc_start_main (libc.so.6 + 0x27d8a)
 postgresml#19 0x0000559d0c8cee15 _start (postgres + 0x158e15)
```

this is because PostgreSQL is using dlopen(RTLD_GLOBAL). this will parse some
of symbols into the previous opened .so file, but the others will use a
relative offset in pgml.so, and will cause a null-pointer crash.

this commit hide all symbols except the UDF symbols (ends with `_wrapper`) and
the magic symbols (`_PG_init` `Pg_magic_func`). so dlopen(RTLD_GLOBAL) will
parse the symbols to the correct position.
@aplchian aplchian mentioned this pull request Sep 12, 2023
4 tasks
SilasMarvin pushed a commit that referenced this pull request Oct 5, 2023
SilasMarvin pushed a commit that referenced this pull request Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants