.. _`setup-configuration`: Setup & Configuration ==================================================================== .. _`quick-setup`: Quick Setup -------------------------------------------------------------------- We assume development or deployment in a MacOS or Linux environment. 1. Install Docker 18 (`Ubuntu `__, `MacOS `__) and, if required, add your user to ``docker`` group (`Linux `__). .. note:: If you're not a user in the ``docker`` group, you'll instead need ``sudo`` access and prefix every bash command with ``sudo -E``. 2. Install Python 3.6 such that the ``python`` and ``pip`` commands point to the correct installation of Python 3.6 (see :ref:`installing-python`). 3. Clone the project at https://github.com/nginyc/rafiki (e.g. with `Git `__) 4. Setup Rafiki's complete stack with the setup script: .. code-block:: shell bash scripts/start.sh *Rafiki Admin* and *Rafiki Web Admin* will be available at ``127.0.0.1:3000`` and ``127.0.0.1:3001`` respectively. To destroy Rafiki's complete stack: .. code-block:: shell bash scripts/stop.sh Scaling Rafiki -------------------------------------------------------------------- Rafiki's default setup runs on a single machine and only runs its workloads on CPUs. Rafiki's model training workers run in Docker containers that extend the Docker image ``nvidia/cuda:9.0-runtime-ubuntu16.04``, and are capable of leveraging on `CUDA-Capable GPUs `__ Scaling Rafiki horizontally and enabling GPU usage involves setting up *Network File System* (*NFS*) at a common path across all nodes, installing & configuring the default Docker runtime to `nvidia` for each GPU-bearing node, and putting all these nodes into a single Docker Swarm. .. seealso:: :ref:`architecture` To run Rafiki on multiple machines with GPUs, do the following: 1. If Rafiki is running, stop Rafiki with ``bash scripts/stop.sh`` 2. Have all nodes `leave any Docker Swarm `__ they are in 3. Set up NFS such that the *master node is a NFS host*, *other nodes are NFS clients*, and the master node *shares an ancestor directory containing Rafiki's project directory*. `Here are instructions for Ubuntu `__ 4. All nodes should be in a common network. On the *master node*, change ``DOCKER_SWARM_ADVERTISE_ADDR`` in the project's ``.env.sh`` to the IP address of the master node in *the network that your nodes are in* 5. For *each node* (including the master node), ensure the `firewall rules allow TCP & UDP traffic on ports 2377, 7946 and 4789 `_ 6. For *each node that has GPUs*: 6.1. `Install NVIDIA drivers `__ for CUDA *9.0* or above 6.2. `Install nvidia-docker2 `__ 6.3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here `__) 7. On the *master node*, start Rafiki with ``bash scripts/start.sh`` 8. For *each worker node*, have the node `join the master node's Docker Swarm `__ 9. On the *master* node, for *each node* (including the master node), configure it with the script: :: bash scripts/setup_node.sh Exposing Rafiki Publicly -------------------------------------------------------------------- Rafiki Admin and Rafiki Web Admin runs on the master node. Change ``RAFIKI_ADDR`` in ``.env.sh`` to the IP address of the master node in the network you intend to expose Rafiki in. Example: :: export RAFIKI_ADDR=172.28.176.35 Re-deploy Rafiki. Rafiki Admin and Rafiki Web Admin will be available at that IP address, over ports 3000 and 3001 (by default), assuming incoming connections to these ports are allowed. **Before you expose Rafiki to the public, it is highly recommended to change the master passwords for superadmin, server and the database (located in `.env.sh` as `POSTGRES_PASSWORD`, `APP_SECRET` & `SUPERADMIN_PASSWORD`)** Reading Rafiki's logs -------------------------------------------------------------------- By default, you can read logs of Rafiki Admin & any of Rafiki's workers in ``./logs`` directory at the root of the project's directory of the master node. Troubleshooting -------------------------------------------------------------------- Q: There seems to be connectivity issues amongst containers across nodes! A: `Ensure that containers are able to communicate with one another through the Docker Swarm overlay network `__