Author: Zhang Luodan
Originally a member of aikesheng DBA team, now a member of lufax DBA team, has a persistent pursuit of technology!
Source: original contribution
*It is produced by aikesheng open source community. The original content cannot be used without authorization. For reprint, please contact Xiaobian and indicate the source.
background
One night, the database hang lived, and the phenomenon was:
- Application error org apache. commons. dbcp. SQLNestedException: Cannot get a connection,pool error Timeout waiting for idle object
- Unable to log in. If you enter the login command, you will be stuck and unable to respond
In desperation, by forcing kill to drop the process, restart the database recovery.
Not to mention the reason why hang lives here, we only analyze the database hang lives, but the MHA does not trigger the handover.
conclusion
First, let's draw a conclusion. By default, MHA uses a long connection to ping the database for health detection (execute select 1 as Value). If MySQL cannot be connected for 4 times, the switch will be triggered.
In the previous database, only the new connection cannot be established, but the old connection has no impact. Moreover, the health detection statement of MHA is very simple. It is only detected in the server layer and does not involve the InnoDB layer. Therefore, MHA believes that MySQL is healthy and has not made any decision.
solve
MHA supports Ping from version 0.53_ The type parameter sets how to check the availability of the master. Three value s are supported:
- Select: use a long connection to connect to MySQL and execute select 1 as Value. This long connection is reused, but the inspection is too simple to find more faults.
- connect: create and disconnect connections before and after each execution of select 1 as Value to find more TCP connection level faults.
Note: in this case, the MHA monitoring process will fork out a child process for detection
- Insert: execute the insert statement based on an existing connection to MySQL to better detect the failure of the database caused by the depletion of disk space or disk IO resources.
By Ping_ Change the type to connect. Each time the MHA detects the process status, it needs to create a new connection. If the new link cannot be successfully established, the switch is triggered.
Three detection mechanism Codes:
##If the distributed lock acquisition fails, it returns 2, 0 normally and 1 abnormally sub ping_connect($) { my $self = shift; my $log = $self->{logger}; my $dbh; my $rc = 1; my $max_retries = 2; eval { my $ping_start = [gettimeofday]; # Connection max_retries times. If the connection fails, exit while ( !$self->{dbh} && $max_retries-- ) { eval { $rc = $self->connect( 1, $self->{interval}, 0, 0, 1 ); }; if ( !$self->{dbh} && $@ ) { die $@ if ( !$max_retries ); } } # Call ping_select $rc = $self->ping_select(); # To hold advisory lock for some periods of time $self->sleep_until( $ping_start, $self->{interval} - 1.5 ); $self->disconnect_if(); }; if ($@) { my $msg = "Got error on MySQL connect ping: $@"; undef $@; $msg .= $DBI::err if ($DBI::err); $msg .= " ($DBI::errstr)" if ($DBI::errstr); $log->warning($msg) if ($log); $rc = 1; } return 2 if ( $self->{_already_monitored} ); return $rc; } # It returns 0 normally and 1 abnormally sub ping_select($) { my $self = shift; my $log = $self->{logger}; my $dbh = $self->{dbh}; my ( $query, $sth, $href ); eval { $dbh->{RaiseError} = 1; $sth = $dbh->prepare("SELECT 1 As Value"); $sth->execute(); $href = $sth->fetchrow_hashref; if ( !defined($href) || !defined( $href->{Value} ) || $href->{Value} != 1 ) { die; } }; if ($@) { my $msg = "Got error on MySQL select ping: "; undef $@; $msg .= $DBI::err if ($DBI::err); $msg .= " ($DBI::errstr)" if ($DBI::errstr); $log->warning($msg) if ($log); return 1; } return 0; } # It returns 0 normally and 1 abnormally sub ping_insert($) { my $self = shift; my $log = $self->{logger}; my $dbh = $self->{dbh}; my ( $query, $sth, $href ); eval { $dbh->{RaiseError} = 1; $dbh->do("CREATE DATABASE IF NOT EXISTS infra"); $dbh->do( "CREATE TABLE IF NOT EXISTS infra.chk_masterha (`key` tinyint NOT NULL primary key,`val` int(10) unsigned NOT NULL DEFAULT '0')" ); $dbh->do( "INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestamp()" ); }; if ($@) { my $msg = "Got error on MySQL insert ping: "; undef $@; $msg .= $DBI::err if ($DBI::err); $msg .= " ($DBI::errstr)" if ($DBI::errstr); $log->warning($msg) if ($log); return 1; } return 0; }
test
MHA profile
[server default] manager_log=/Data/mha/log/workdir/my3306tst.log manager_workdir=/Data/mha/workdir/my3306tst remote_workdir=/Data/mysql/my3306/mha master_binlog_dir=/Data/mysql/my3306/log password=xxx ping_interval=5 repl_password=xxx repl_user=xxx ssh_user=mysql ssh_port=xxx user=mha master_ip_online_change_script="/usr/local/bin/master_ip_online_change" master_ip_failover_script="master_ip_failover" [server1] hostname=xxx port=3306 candidate_master=1 [server2] hostname=xxx port=3306 candidate_master=1
Note: Ping during testing_ The interval is set to 5 to facilitate rapid observation of switching. In actual production, it can be adjusted according to the fault tolerance of the business.
The simulation server CPU is full, and the database cannot establish a new connection
Write a simple c program as follows:
# include <stdio.h> int main() { while(1); return 0; }
compile:
gcc -o out test_cpu.c
Execution:
for in in `seq 1 $(cat /proc/cpuinfo | grep "physical id" | wc -l)`; do ./out & done
In addition, run two mysqlslap pressure measurement programs:
mysqlslap -c 30000 -i 100 --detach=1 --query="select 1 from dual" --delimiter=";" -uxxx -pxxx -S /xxxx/xxx.sock
- ping_ When type = connect, 4 connection failures trigger switching
At this time, you can see in the MHA switching log that the output of the error reported by the connection database is as follows:
Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'waiting for initial communication packet',system error: 110)
- ping_ When type = select, switching is not triggered
Interested students can test it by themselves
MHA health detection mechanism
Call link:
MasterMonitor.pm|MHA::MasterMonitor::main() --> MasterMonitor.pm|MHA::MasterMonitor::wait_until_master_is_dead() --> MasterMonitor.pm|MHA::MasterMonitor::wait_until_master_is_unreachable() --> MHA::HealthCheck::wait_until_unreachable(); --> HealthCheck.pm|MHA::HealthCheck::ping_select(perhaps) HealthCheck.pm|MHA::HealthCheck::ping_insert(perhaps) HealthCheck.pm|MHA::HealthCheck::ping_connect(perhaps)
After the MHA monitoring process is started, it will continuously monitor the status of the master node. The main health detection function is wait_until_unreachable().
PS: during the startup of MHA monitoring process, the configuration file will be read and a series of checks will be carried out on the server in the configuration file, including survival status, version information, slave library configuration (read_only, relay_log_purge, log bin, replication filtering, etc.), ssh status, etc. if the check fails, it cannot be started
In this function, there will be an endless loop for continuous health detection
1. First, test the connection. If the connection is correct, return 0, otherwise return 1.
- If the connection to MySQL is successful, the distributed lock is obtained. If the acquisition of the distributed lock fails, the returned status value is 1
- If the connection to MySQL fails, the status value 1 and the error message of connection failure are returned. For the following cases of connection failure (1040 connections are full and 1045 permissions are denied), MHA will think that the MySQL process is normal and will not trigger switching, but will always carry out connection detection
our @ALIVE_ERROR_CODES = ( 1040, # ER_CON_COUNT_ERROR 1042, # ER_BAD_HOST_ERROR 1043, # ER_HANDSHAKE_ERROR 1044, # ER_DBACCESS_DENIED_ERROR 1045, # ER_ACCESS_DENIED_ERROR 1129, # ER_HOST_IS_BLOCKED 1130, # ER_HOST_NOT_PRIVILEGED 1203, # ER_TOO_MANY_USER_CONNECTIONS 1226, # ER_USER_LIMIT_REACHED 1251, # ER_NOT_SUPPORTED_AUTH_MODE 1275, # ER_SERVER_IS_IN_SECURE_AUTH_MODE );
2. After the test connection is successful, carry out health status detection (the above three methods); If the connection fails for four consecutive times, the second script will be used for detection on the fourth time (if defined). If the detection passes, it is considered that the master is dead
Key function wait_until_unreachable() Code:
# main function sub wait_until_unreachable($) { my $self = shift; my $log = $self->{logger}; my $ssh_reachable = 2; my $error_count = 0; my $master_is_down = 0; eval { while (1) { $self->{_tstart} = [gettimeofday]; ## Determine whether a connection needs to be established if ( $self->{_need_reconnect} ) { my ( $rc, $mysql_err ) = $self->connect( undef, undef, undef, undef, undef, $error_count ); if ($rc) { if ($mysql_err) { # Error code in alive_ ERROR_ When in codes, the switch is not triggered. Common problems are that the user password is incorrect and the switch will not be triggered if ( grep ( $_ == $mysql_err, @MHA::ManagerConst::ALIVE_ERROR_CODES ) > 0 ) { $log->info( "Got MySQL error $mysql_err, but this is not a MySQL crash. Continue health check.." ); # Next goes directly to the next cycle $self->sleep_until(); next; } } $error_count++; $log->warning("Connection failed $error_count time(s).."); $self->handle_failing(); if ( $error_count >= 4 ) { $ssh_reachable = $self->is_ssh_reachable(); # Returning 1 means that the main database is down, and 0 means that the main database is not down $master_is_down = 1 if ( $self->is_secondary_down() ); # The main database down jumps out of the loop last if ($master_is_down); $error_count = 0; } $self->sleep_until(); next; } # connection ok $self->{_need_reconnect} = 0; $log->info( "Ping($self->{ping_type}) succeeded, waiting until MySQL doesn't respond.." ); } # If Ping_ If type is connect, disconnect $self->disconnect_if() if ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_CONNECT ); # Parent process forks one child process. The child process queries # from MySQL every <interval> seconds. The child process may hang on # executing queries. # DBD::mysql 4.022 or earlier does not have an option to set # read timeout, executing queries might take forever. To avoid this, # the parent process kills the child process if it won't exit within # <interval> seconds. my $child_exit_code; eval { # Call detection function if ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_CONNECT ) { $child_exit_code = $self->fork_exec( sub { $self->ping_connect() }, "MySQL Ping($self->{ping_type})" ); } elsif ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_SELECT ) { $child_exit_code = $self->fork_exec( sub { $self->ping_select() }, "MySQL Ping($self->{ping_type})" ); } elsif ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_INSERT ) { $child_exit_code = $self->fork_exec( sub { $self->ping_insert() }, "MySQL Ping($self->{ping_type})" ); } else { die "Not supported ping_type!\n"; } }; if ($@) { my $msg = "Unexpected error heppened when pinging! $@"; $log->error($msg); undef $@; $child_exit_code = 1; } if ( $child_exit_code == 0 ) { #ping ok ## If the ping is successful, update the status and set the counter to 0 $self->update_status_ok(); if ( $error_count > 0 ) { $error_count = 0; } $self->kill_sec_check(); $self->kill_ssh_check(); } elsif ( $child_exit_code == 2 ) { $self->{_already_monitored} = 1; croak; } else { ## Failed to create connection # failed on fork_exec $error_count++; $self->{_need_reconnect} = 1; $self->handle_failing(); } $self->sleep_until(); } $log->warning("Master is not reachable from health checker!"); }; if ($@) { my $msg = "Got error when monitoring master: $@"; $log->warning($msg); undef $@; return 2 if ( $self->{_already_monitored} ); return 1; } return 1 unless ($master_is_down); return ( 0, $ssh_reachable ); } 1;