Troubleshooting

  1. Installation Guide
  2. Installation FAQs and Troubleshooting
  3. Basic Management Operations
  4. How to Manage Users and Groups
  5. How to Set Up Storage
  6. How to Set Up Virtual Clusters
  7. How to Add and Remove Nodes
  8. How to use CPU Nodes
  9. How to Customize Cluster by Plugins
  10. Troubleshooting (this document)
  11. How to Uninstall OpenPAI
  12. Upgrade Guide

NVIDIA GPU is Not Detected

If you cannot use GPU in your job, please check the following items on the corresponding worker node:

  1. The NVIDIA drivers should be installed correctly. Use nvidia-smi to confirm.
  2. nvidia-container-runtime is installed, and configured as the default runtime of docker. Use docker info -f "{{json .DefaultRuntime}}" to confirm.

If the GPU number shown in webportal is wrong, check the hivedscheduler and VC configuration.

A Certain Node is Lost

If the node is lost temporarily, you can wait until it works normally.

If you want to remove the node from your cluster, refer to How to Add and Remove Nodes.

A Certain PAI Service is Not Working

You can see service log on the Kubernetes Dashboard for triage. After the problem is addressed, restart the service using paictl.py:

./paictl.py service stop -n <service-name>
./paictl.py service start -n <service-name>