Appearance and Mechanism of a Database Failure, 해시게임
The project team reported a failure case to D-SMART R&D. This project is a normalized optimization mechanism for D-SMART-based monitoring functions.
They found that a database had recently and occasionally had LOGON for a very long time.
After analysis by the on-site DBA, it was found that the amount of data had reached tens of millions because the AUD$ had not been cleaned up for a long time.
After clearing AUD$, no similar phenomenon has been found for the time being.
Based on this case, they reported an operation and maintenance experience to the D-SMART project team,
That is, if AUD$ is not cleaned up for a long time, there will be a risk of increased session login delay.
After seeing this requirement, my first reaction was that there should be an AUD$ check item in the D-SMART daily inspection.
So let them confirm on the spot whether they have disabled this check item in the template.
After inspection, it was found that they were not disabled, and there were alarms in this regard every day.
In the past, this daily inspection alarm was also reported to the person in charge of Party A, because the network was blocked,
I can’t do the cleaning operation recently, so there is a backlog of data.
After dealing with this problem, I will go to other things.
But I always felt something was wrong with this.
Suddenly I thought, what is the inherent relationship between the excessive amount of AUD$ data and the LOGON timeout?
It doesn’t seem to be directly related.
Because LOGON just inserts a piece of data into AUD$,
I didn’t read the data in the AUD$ table to do any analysis.
There should not be such a huge difference between inserting a piece of data into a table with 1000 records and inserting a piece of data into a table with 10 million records.
So I asked the on-site DBA on WeChat, how did he find that AUD$ caused the problem of LOGON timeout.
He said that he analyzed the insert statement of AUD$ during the failure, 244 INSERTs took 128 seconds,
After clearing the AUD$, it took 1.25 seconds for 304 entries, and the effect is very obvious.
So I asked him to check if there was a mechanism like LOGON flip-flop,
Or some special audit items, his feedback is no.
This is puzzling. From a mechanical point of view, changing the AUD$ table to 5GB+ will not cause the performance of inserting a record to drop by 100 times, resulting in a LOGON timeout.
So I believe there must be other reasons for this problem.
So I asked him to use hola to export the data to the laboratory.
Unfortunately, their version is V2.1, and because they have to manage thousands of operation and maintenance objects,
Instead, the distributed deployment mode is adopted. The current hola 1.0.2 has a bug.
Unable to export data from D-SMART in a distributed deployment.
So I asked him to generate a “Problem Analysis” report for each of the two points in time when the failure occurred today, and send it over.
Another node also experienced the phenomenon of LOGON timeout, but from the problem analysis report, the waiting event is completely different.
The top ones are IO-related indicators such as log file sync.
At the same time, we also found that there is a phenomenon of GCS log flush sync in the system.
From these issues, the increase in the AUD$ write latency is not so much more like the performance of inserting data is affected by other issues.
So I asked the on-site DBA to send both operating system logs and database logs.
On the node with GC contention, the log is normal, and on the node with the log file sync delay timeout, I found the log alarm information of the multi-path jitter.
Since then, the context of the problem has become very clear.
Because of the multi-path jitter on node 2, the IO delay is unstable,
This leads to the performance problem of AUD$ inserting data, which eventually leads to the LOGON timeout.
An inadvertent discovery has ruled out a serious hidden danger in a system that is very busy at the end of each month and at the beginning of the month, requiring a lot of data writing and calculations.
Fortunately, the problem was discovered when the business was very small.
Otherwise, there will be big problems at the end of the month.
The discovery process of this problem also has a lot of accidental components.
Originally, the on-site DBA thought the problem had been solved.
If it weren’t for the existence of this fault case-sharing mechanism between the field and the backend,
It’s likely that this hidden danger won’t be discovered until it causes a major problem.
In analyzing the problem, in most cases, we only start from the appearance, and the problem will be solved if it does not recur for the time being.
And if the person analyzing this problem lacks a deep understanding of the database mechanism,
It is difficult to find deep-seated problems from these appearances.
And in fact, in the operation and maintenance system, it is impossible for front-line engineers to place such a high level of DBA.
It is for this reason that we have been emphasizing that tools are not omnipotent,
and front-line on-site operation and maintenance are not omnipotent.
It is necessary to form a good closed-loop analysis ecology of problems so that high-level experts, first-line and second-line operation and maintenance personnel,
High-quality monitoring data collection and analysis tools are combined to form a complete system,
In order to be able to analyze and solve problems more efficiently and accurately.
On the issue of appearance and mechanism, I have always emphasized the importance of tracing the source of the problem or the root cause.
I have discussed this issue with some operation and maintenance experts before.
Some people from Internet companies think that there is a problem with the system.
There is a high-availability mechanism to solve it, and cutting off some problematic components will naturally solve the problem.
Some people think that it is the key to get back to operation as soon as possible after a problem occurs, and root cause analysis can be done if it can be done.
Even if it cannot be done, the enterprise cannot invest such a large cost.
From the perspective of certain scenarios and users, these points are also true.
However, not all companies have the complete high availability mechanism of Internet companies.
Not all systems can be solved by restarting.
Therefore, tracing the source of the problem and closing the loop of the problem should still be valuable for most enterprises.
The main problem that cannot be fully implemented at present is the problem of too high cost and insufficient capacity.
The IT health management mechanism is to solve the cost problem of this decentralized analysis.
Rich data acquisition and diagnostic reports through D-SMART, coupled with the recently launched hold data exchange tool.
Third-line experts can serve customers across the country without leaving their homes, and their efficiency has improved significantly.
Yesterday’s question, including the discussion in the WeChat group and the time for on-site data collection,
The time for experts to participate in locating the problem is only 20 minutes.