Skip to content

mpi2/impc-etl

Repository files navigation

IMPC ETL Process

IMPC Extraction, Transformation and Loading process to generate the data that supports mousephenotype.org among with other internal processes.

Requirements

How to run it

Download the latest release package from the releases page and decompress it. Then submit your job to your Spark 2 cluster using:

spark-submit --py-files impc_etl.zip,libs.zip main.py

Development environment setup

  1. Clone the impc-airflow repository

    git clone https://github.com/mpi2/impc-airflow.git
    cd impc-airflow
  2. Download test archive data

    cd data/data-archive
    wget https://ftp.ebi.ac.uk/pub/databases/impc/other/data-archive.zip
    unzip data-archive.zip
    cd ../../
  3. Clone the impc-etl repository:

    git clone https://github.com/mpi2/impc-etl.git
    git checkout dev
  4. Create symlink from dags to impc-elt:

    cd ../impc-airflow
    ln -s ln -s $PWD/../impc-etl $PWD/dags
  5. Start docker compose services:

    docker compose build
    docker compose up -d
  6. Import variables by using airflow web interface, going to localhost:8080, then Settings > Variables > Import Variables.

  7. Add the Connections information, One for Apache Spark:

    spark_connection_example.png

    And one for the DCC HTTP connections with the provided credentials:

    dcc_http_example.png

  8. Use your favorite IDE to make your changes and make sure the project is pointing to the venv generated. To do that using Pycharm fo to the instructions here.

  9. (Optional) Run PySpark on Jupyter to test your code locally:

    docker compose --profile debug up

    Then open the Example Jupyter notebook at localhost:8889.

Re-generate the documentation

pdoc --html --force --template-dir docs/templates -o docs impc_etl

About

IMPC ETL Process

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 8