Diagnosing Problems With Containerized Apps

Glitch
4 min readJan 15, 2023

--

Steps of problem diagnosis

Monitoring the status of the system

Discovering problems

Analyzing problems

Locating courses

Proposing solutions

Applying solution

Verifying whether the problem has been solved.

As the classical iceberg model shows,the performances are often only a small part of the problems.For complex systems and applications, problems may be caused by deeper causes.So to solve the problem, the first difficult task is to analyze the phenomenon and find the cause of the problem.Analyzing and diagnosing problems be the most critical step.Only by locating the cause can we really solve the problem.For example, the cause may be hardware problems, bottlenecks,code defects, network model bottlenecks and so on.For containerized applications, problem diagnosis becomes more complex because traditional applications are deployed on servers or virtual machines.You can fully monitor the resources used by applications and the process of application running.The running status of the whole application is transparent to users.However, for containers, users cannot directly detect the status of applications, especially for container clusters,user cannot directly monitor the status of resources used and system processes.For example, for the kubernetes cluster,users can only see the status of pod and cannot monitor the container and its applications in real-time just like traditional applications.When the fault phenomenon is expressed as anomaly of pod,its essence and cause may be different,such as the status of the cluster is abnormal.The security group strategy is incorrect.The physical resources are insufficient and so long.Therefore, for the problem of diagnosis of containerized applications,its key work is to analyze and identify the phenomenon of the problem so as to find the cause of the problem.

Kubernetes basic architecture

Nowadays, kubernetes be the most popular open source orchestration and management tool for docker containers.In order to better understand the relationship between the phenomenon and nature of kubernetes cluster failures,it is necessary to review the overall architecture of kubernetes. Kubernetes is a distributed system and a kubernetes cluster includes Master Nodes and multiple computing nodes.The master node includes four important components,etcd, API server, controller manager and scheduler.

The computing node consists of two main components,

cooper kubelet

kube proxy

The etcd, either distributed reliable key value stored for the most critical data of the cluster.It saves the states of the whole system.The API server provides the only access interface to resource operation and it provides authentication, and authorization. Access control, API registration and discovery mechanisms.The scheduler is responsible for resource scheduling.It schedules pod to the corresponding machine according to the scheduled strategy.The controller manager is responsible for maintaining the status of the cluster such as fault detection, automatic expansion,rolling updates, etc, especially the kubernetes cluster can also interact with other cloud resources by controller manager.In a worker node, the kubelet is used to maintain the life cycle of the container as well as managing volumes and network.And the kube proxy is designed for providing service discovery and load balancing to cluster services.In addition, docker runs by container runtime,which manages docker image and the real operation of pause and containers.

After reviewing the basic structure and components of Kubernetes, the following is a brief introduction of some possible fault phenomena corresponding to each component so as to provide some references for problem diagnosis of kubernetes cluster.For specific functional components such as kube proxy,if it is abnormal, a network failure of pod service may be resulted in.And when the component of kubelet fails,it may cause an exception to the pod state.If the resources in the cloud platform are not created properly,the cloud controller may have failed.When kube controller is abnormal,it may cause the system failure recovery to fail to work.If kube schedule fails, it may lead to abnormal scheduling of pod. And the kube API server will monitor these functional components in an asynchronous way and detect their state.Usually the kube API server does not have an exception.The data and the state of the cluster will be persistently recorded into the component of etcd through kube API server.Therefore the kubernetes cluster should be given priority to ensure the normal state of etcd rather than other problems and faults in the system.Later, some common problems of the system and the general steps of problem diagnosis in the kubernetes cluster will be introduced.

--

--

No responses yet