Setup & Configuration¶
Quick Setup¶
We assume development or deployment in a MacOS or Linux environment.
Note
If you’re not a user in the docker group, you’ll instead need sudo access and prefix every bash command with sudo -E.
Install Python 3.6 such that the
pythonandpipcommands point to the correct installation of Python 3.6 (see Installing Python).Clone the project at https://github.com/nginyc/rafiki (e.g. with Git)
Setup Rafiki’s complete stack with the setup script:
bash scripts/start.sh
Rafiki Admin and Rafiki Admin Web will be available at 127.0.0.1:3000 and 127.0.0.1:3001 respectively.
To destroy Rafiki’s complete stack:
bash scripts/stop.sh
Scaling Rafiki¶
Rafiki’s default setup runs on a single machine and only runs its workloads on CPUs.
Rafiki’s model training workers run in Docker containers that extend the Docker image nvidia/cuda:9.0-runtime-ubuntu16.04,
and are capable of leveraging on CUDA-Capable GPUs
Scaling Rafiki horizontally and enabling GPU usage involves setting up Network File System (NFS) at a common path across all nodes, installing & configuring the default Docker runtime to nvidia for each GPU-bearing node, and putting all these nodes into a single Docker Swarm.
See also
To run Rafiki on multiple machines with GPUs, do the following:
If Rafiki is running, stop Rafiki with
bash scripts/stop.shHave all nodes leave any Docker Swarm they are in
Set up NFS such that the master node is a NFS host, other nodes are NFS clients, and the master node shares an ancestor directory containing Rafiki’s project directory. Here are instructions for Ubuntu
All nodes should be in a common network. On the master node, change
DOCKER_SWARM_ADVERTISE_ADDRin the project’s.env.shto the IP address of the master node in the network that your nodes are inFor each node (including the master node), ensure the firewall rules allow TCP & UDP traffic on ports 2377, 7946 and 4789
For each node that has GPUs:
6.1. Install NVIDIA drivers for CUDA 9.0 or above
6.3. Set the
default-runtimeof Docker to nvidia (e.g. instructions here)On the master node, start Rafiki with
bash scripts/start.shFor each worker node, have the node join the master node’s Docker Swarm
On the master node, for each node (including the master node), configure it with the script:
bash scripts/setup_node.sh
Exposing Rafiki Publicly¶
Rafiki Admin and Rafiki Admin Web runs on the master node.
Change RAFIKI_ADDR in .env.sh to the IP address of the master node
in the network you intend to expose Rafiki in.
Example:
export RAFIKI_ADDR=172.28.176.35
Re-deploy Rafiki. Rafiki Admin and Rafiki Admin Web will be available at that IP address, over ports 3000 and 3001 (by default), assuming incoming connections to these ports are allowed.
Before you expose Rafiki to the public, it is highly recommended to change the master passwords for superadmin, server and the database (located in `.env.sh` as `POSTGRES_PASSWORD`, `APP_SECRET` & `SUPERADMIN_PASSWORD`)
Reading Rafiki’s logs¶
By default, you can read logs of Rafiki Admin, Rafiki Advisor & any of Rafiki’s workers
in ./logs directory at the root of the project’s directory of the master node.
Troubleshooting¶
Q: There seems to be connectivity issues amongst containers across nodes!