Monitoring

When you run your jobs, it’s your responsibility to make sure they are running as intended, without overrequesting the resources. Our Grafana page is a great resource to see what your jobs are doing.

To get an idea how much resources your jobs are using, go to namespace dashboard and choose your namespace. Your requests percentage for memory and CPU should be as close to 100% as possible. Also check the GPU dashboard for your namespace to make sure the utilization is above 40%, and ideally is close to 100%.

When checking the memory utilization, make sure to use the Memory Usage (RSS) column. The Memory Usage includes the disk cache, which can grow indefinitely.

Comet.ml

If you are training models such as neural networks, statistical models, and the like on platforms such as Python, Tensorflow, PyTorch, etc, it is common to plot real time statistics to tools such as Tensorboard. Tensorboard in particular is an excellent real time visualization tool, but requires that you launch the Tensorboard process and keep track of the log files, all of which are extra steps to deal with under cluster environments such as the PRP. An alternative solution is to use http://comet.ml, which is free for academic users, and provides a similar set of functions as Tensorboard (plus a Baysian Hyperparameter Tuning tool). Comet.ml stores everything on their website, so there are no logs to maintain or servers to run, and this makes it an easy solution to deploy on a distributed cluster like the PRP.