Today we will learn how to troubleshoot common errors faced in openstack setups. Openstack is one of the most widely used open source application for setting up public and private cloud. It lets you build and manage complex data center infrastructure with great ease. It has been in industry for more than 6 years now and it is being evolved continuously to meet the challenging needs of modern day computing. If you have a pool of resources like Compute, Storage and network components, you can use openstack on top of this pool to build a working cloud setup. Latest release of openstack; known as Mitaka; has simplified cloud setup and management to great extent but still you might run into many issues when creating instances, or setting up networks/router or configuring object storage etc. Let’s see how to proceed with troubleshooting of different components of openstack.
Openstack Components
Here are main components of openstack which are usually present in medium to large scale installation of openstack. We will go ahead with troubleshooting tips for every component in details here.
- Controller
- Compute
- Network (Neutron)
- Image Service
- Dashboard (Horizon)
Let’s review how to troubleshoot each of the above mentioned components in details.
Troubleshooting Controller
Controller is the most important component of openstack setup. Controller is responsible for proper communication between all other components (computes, networks, storage etc). Controller runs message broker service and uses this to facilitate communication between all pillars of a cloud. Controller also runs database service, and all other openstack services uses this controller for the storage of their databases. Controller is usually the main face of the cloud, it is where dashboard (horizon) is usually running and Controller’s interface may be exposed to external world for successfully access to the cloud. If you are seeing errors on Controller, here are the log files you should check to identify and correct these errors.
/var/log/keystone/keystone.log > Check this file is you are facing authentication related errors on different services.
/var/log/messages > Check this file if you are seeing errors on access among different cloud nodes.
/var/log/firewalld > Check this file if you are having hard time getting your services to bind to certain ports/IPs.
Here are the services that must be running on Controller node(s), if any one of them is failing, your cloud must be showing errors (Use following commands to verify services status on CentOS system, you can use similiar utility for Ubuntu to verify service’s status).
Troubleshooting Openstack Compute
Compute is the main component that is used to store data about virtual machines and their related aspects. Compute can be a single node or a set of nodes, depending on your infrastructure. If you are unable to launch new instances, then its 90% sure that something might be messed up on Compute component. Compute related issues can be troubleshooted on both Controller and Compute nodes. Here are some common log files you should peek into if you are seeing compute related errors.
/var/log/nova/nova-api.log > This log file is located on both controller and compute nodes. Open this file to see whats exactly error your compute component is throwing when using compute related operation from horizon or command line.
/var/log/nova/nova-cert.log > Check this file if you compute node is throwing errors related to secure layer protocol. This file will be available on controller node only.
/var/log/nova/nova-novncproxy.log > If you are able to launch instances but can not access their VNC console, then this is the correct log file to look for. It is located on Controller node only.
/var/log/nova/nova-compute.log > This is the most important log file and is located on Compute nodes only. If you are unable to launch new instances, use this log file to identify the exact source of problem.
/var/log/nova/nova-api-metadata.log > If your openstack instances are complaining about instance’s meta data , you should check this file on both Controller and Compute nodes to find out the problem.
In case of compute related errors, always make sure that all required services are up and running. Here are the compute services that should always be in running status.
Following compute related services should be in “Running” status on Controller nodes.
Following services should be in “Running” status on Compute nodes.
Troubleshooting Networking (Neutron) Component
Neutron is the networking component for openstack, you need to create networks, routers, VPNs etc in this component and all traffic coming into openstack cloud is first filtered at Neutron level, so in order to achieve network connectivity and enable inter communication among virtual machines, Neutron should be working fine. In old versions of Openstack, neutron was the part of Compute (Nova), but in recent release, openstack development team has removed it from Nova and made it a seperate component. Lot of features are being added to Neutron so that it may cope with growing needs of modern day network virtualization. Neutron/Network is usually an independent node just like Controller and Compute, but sometimes, Controller and Neutron components are installed on the same machine, which works too. Let’s see how to troubleshoot Neutron related issues and which services should be running on Controller, Compute and Network node for sucessful working of Neutron.
/var/log/neutron/server.log > If you are unable to create networks or routers, this is the very first log file to check. It is located on Neutron/Network node.
/var/log/neutron/openvswitch-agent.log > It is located on both Network and Compute nodes. If your virtual machines are failing to reach external network or virtual routers, you should check this file for identifying the exact errors.
/var/log/neutron/metadata-agent.log > This file can be found on either Controller or Compute node. It stores common neutron errors with respect to metadata.
/var/log/neutron/vpn-agent.log > If you have VPN component of Neutron enabled, then this log file will store VPN related error logs. If your site-to-site VPNs are not working or you are having issues with IPSEC, this is the place to look for root cause of problem.
Alright, lets see which services need to be in running status on Controller, Compute and Neutron nodes.
On Controller node, make sure following command returns service status as “Running”:
Following are the commands to verify that all neutron related services are running fine on Network Node. If any one of the followings returns failed status, Neutron is likely to not function to its fullest.
Following are the services that must be in “Running” state on Compute nodes.
Troubleshooting Openstack Image Service (Glance)
Openstack image service is also called Glance, it is the service which is responsible for storing cloud images for various operating systems and our openstack setup uses these images to spin up new instances. Glance is not as complex componet as Neutron or compute, so its pretty easy to troubleshoot. Let’s see which log files to view in case of problems with Glance images. Please note that Glance is usually installed on the Controller node, so you should be seeing following mentioned files/services on Controller component.
/var/log/glance/api.log > This file stores all communication between Glance and Openstack Dashboard, so if you are unable to perform any Glance related operation from Horizon, check logs in this files to find out the exact cause of problem.
/var/log/glance/registry.log > It’s another file to look for when facing problem with Glance and its flavors, this file stores logs for different operations between keystone, glance, and compute.
On Controller node, following command should return service status as “Running”; otherwise your openstack setup will be unable to launch new instances and existing instances will show unpredictable behavior.
Troubleshooting Openstack Dashboard (Horizon)
Horizon is the dashboard for openstack, all operations within your cloud infrastructure are performed from Horizon. If your Horizon is not loading or not responding to external requests, its probably problem with Apache Web Server or Memcachd. Horizon uses Apache web server, so in order to troubleshoot errors related to Horizon, we need to go deeper into Web Server logs.
/etc/httpd/logs/error_log > This file should contain apache web server error logs, check this file to find out the problems with Horizon Dashboard.
Following services should be in “Running” state on Controller node for the successful operations of Dashboard.
Conclusion
Openstack is really good at logging errors, so almost all openstack related issues/errors can be easily identified and troubleshooted using log files mentioned above. We just need to act smart, make sure we are looking in correct log files and reading the proper information in logs. Hope you enjoyed this article, do let us know in comments please!