Troubleshooting
- Installation Guide
- Installation FAQs and Troubleshooting
- Basic Management Operations
- How to Manage Users and Groups
- How to Set Up Storage
- How to Set Up Virtual Clusters
- How to Add and Remove Nodes
- How to use CPU Nodes
- How to Customize Cluster by Plugins
- Troubleshooting (this document)
- How to Uninstall OpenPAI
- Upgrade Guide
NVIDIA GPU is Not Detected
If you cannot use GPU in your job, please check the following items on the corresponding worker node:
- The NVIDIA drivers should be installed correctly. Use
nvidia-smito confirm. - nvidia-container-runtime is installed, and configured as the default runtime of docker. Use
docker info -f "{{json .DefaultRuntime}}"to confirm.
If the GPU number shown in webportal is wrong, check the hivedscheduler and VC configuration.
A Certain Node is Lost
If the node is lost temporarily, you can wait until it works normally.
If you want to remove the node from your cluster, refer to How to Add and Remove Nodes.
A Certain PAI Service is Not Working
You can see service log on the Kubernetes Dashboard for triage. After the problem is addressed, restart the service using paictl.py:
./paictl.py service stop -n <service-name>
./paictl.py service start -n <service-name>