IMPC Extraction, Transformation and Loading process to generate the data that supports mousephenotype.org among with other internal processes.
Download the latest release package from the releases page and decompress it. Then submit your job to your Spark 2 cluster using:
spark-submit --py-files impc_etl.zip,libs.zip main.py
-
Clone the impc-airflow repository
git clone https://github.com/mpi2/impc-airflow.git cd impc-airflow
-
Download test archive data
cd data/data-archive wget https://ftp.ebi.ac.uk/pub/databases/impc/other/data-archive.zip unzip data-archive.zip cd ../../
-
Clone the impc-etl repository:
git clone https://github.com/mpi2/impc-etl.git git checkout dev
-
Create symlink from
dags
toimpc-elt
:cd ../impc-airflow ln -s ln -s $PWD/../impc-etl $PWD/dags
-
Start docker compose services:
docker compose build docker compose up -d
-
Import variables by using airflow web interface, going to localhost:8080, then Settings > Variables > Import Variables.
-
Add the Connections information, One for Apache Spark:
And one for the DCC HTTP connections with the provided credentials:
-
Use your favorite IDE to make your changes and make sure the project is pointing to the venv generated. To do that using Pycharm fo to the instructions here.
-
(Optional) Run PySpark on Jupyter to test your code locally:
docker compose --profile debug up
Then open the Example Jupyter notebook at localhost:8889.
pdoc --html --force --template-dir docs/templates -o docs impc_etl