IMPC ETL Process

IMPC Extraction, Transformation and Loading process to generate the data that supports mousephenotype.org among with other internal processes.

Requirements

How to run it

Download the latest release package from the releases page and decompress it. Then submit your job to your Spark 2 cluster using:

spark-submit --py-files impc_etl.zip,libs.zip main.py

Development environment setup

Clone the impc-airflow repository

git clone https://github.com/mpi2/impc-airflow.git
cd impc-airflow

Download test archive data

cd data/data-archive
wget https://ftp.ebi.ac.uk/pub/databases/impc/other/data-archive.zip
unzip data-archive.zip
cd ../../

Clone the impc-etl repository:

git clone https://github.com/mpi2/impc-etl.git
git checkout dev

Create symlink from dags to impc-elt:

cd ../impc-airflow
ln -s ln -s $PWD/../impc-etl $PWD/dags

Start docker compose services:

docker compose build
docker compose up -d

Import variables by using airflow web interface, going to localhost:8080, then Settings > Variables > Import Variables.
Add the Connections information, One for Apache Spark:

And one for the DCC HTTP connections with the provided credentials:
Use your favorite IDE to make your changes and make sure the project is pointing to the venv generated. To do that using Pycharm fo to the instructions here.
(Optional) Run PySpark on Jupyter to test your code locally:
```
docker compose --profile debug up
```
Then open the Example Jupyter notebook at localhost:8889.

Re-generate the documentation

pdoc --html --force --template-dir docs/templates -o docs impc_etl

Name		Name	Last commit message	Last commit date
Latest commit History 2,129 Commits
.vscode		.vscode
docs		docs
impc_etl		impc_etl
lib		lib
tests		tests
.airflowignore		.airflowignore
.gitignore		.gitignore
.nojekyll		.nojekyll
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IMPC ETL Process

Requirements

How to run it

Development environment setup

Re-generate the documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

mpi2/impc-etl

Folders and files

Latest commit

History

Repository files navigation

IMPC ETL Process

Requirements

How to run it

Development environment setup

Re-generate the documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages