Failure analysis | database failure MHA not switched

Posted by scott56hannah on Mon, 21 Feb 2022 09:05:19 +0100

Author: Zhang Luodan

Originally a member of aikesheng DBA team, now a member of lufax DBA team, has a persistent pursuit of technology!

Source: original contribution

*It is produced by aikesheng open source community. The original content cannot be used without authorization. For reprint, please contact Xiaobian and indicate the source.

background

One night, the database hang lived, and the phenomenon was:

  • Application error org apache. commons. dbcp. SQLNestedException: Cannot get a connection,pool error Timeout waiting for idle object
  • Unable to log in. If you enter the login command, you will be stuck and unable to respond

In desperation, by forcing kill to drop the process, restart the database recovery.

Not to mention the reason why hang lives here, we only analyze the database hang lives, but the MHA does not trigger the handover.

conclusion

First, let's draw a conclusion. By default, MHA uses a long connection to ping the database for health detection (execute select 1 as Value). If MySQL cannot be connected for 4 times, the switch will be triggered.
In the previous database, only the new connection cannot be established, but the old connection has no impact. Moreover, the health detection statement of MHA is very simple. It is only detected in the server layer and does not involve the InnoDB layer. Therefore, MHA believes that MySQL is healthy and has not made any decision.

solve

MHA supports Ping from version 0.53_ The type parameter sets how to check the availability of the master. Three value s are supported:

  • Select: use a long connection to connect to MySQL and execute select 1 as Value. This long connection is reused, but the inspection is too simple to find more faults.
  • connect: create and disconnect connections before and after each execution of select 1 as Value to find more TCP connection level faults.

Note: in this case, the MHA monitoring process will fork out a child process for detection

  • Insert: execute the insert statement based on an existing connection to MySQL to better detect the failure of the database caused by the depletion of disk space or disk IO resources.

By Ping_ Change the type to connect. Each time the MHA detects the process status, it needs to create a new connection. If the new link cannot be successfully established, the switch is triggered.

Three detection mechanism Codes:

##If the distributed lock acquisition fails, it returns 2, 0 normally and 1 abnormally
sub ping_connect($) {
  my $self = shift;
  my $log  = $self->{logger};
  my $dbh;
  my $rc          = 1;
  my $max_retries = 2;
  eval {
    my $ping_start = [gettimeofday];
    # Connection max_retries times. If the connection fails, exit
    while ( !$self->{dbh} && $max_retries-- ) {
      eval { $rc = $self->connect( 1, $self->{interval}, 0, 0, 1 ); };
      if ( !$self->{dbh} && $@ ) {
        die $@ if ( !$max_retries );
      }
    }
    # Call ping_select
    $rc = $self->ping_select();

    # To hold advisory lock for some periods of time
    $self->sleep_until( $ping_start, $self->{interval} - 1.5 );
    $self->disconnect_if();
  };
  if ($@) {
    my $msg = "Got error on MySQL connect ping: $@";
    undef $@;
    $msg .= $DBI::err if ($DBI::err);
    $msg .= " ($DBI::errstr)" if ($DBI::errstr);
    $log->warning($msg) if ($log);
    $rc = 1;
  }
  return 2 if ( $self->{_already_monitored} );
  return $rc;
}

# It returns 0 normally and 1 abnormally
sub ping_select($) {
  my $self = shift;
  my $log  = $self->{logger};
  my $dbh  = $self->{dbh};
  my ( $query, $sth, $href );
  eval {
    $dbh->{RaiseError} = 1;
    $sth = $dbh->prepare("SELECT 1 As Value");
    $sth->execute();
    $href = $sth->fetchrow_hashref;
    if ( !defined($href)
      || !defined( $href->{Value} )
      || $href->{Value} != 1 )
    {
      die;
    }
  };
  if ($@) {
    my $msg = "Got error on MySQL select ping: ";
    undef $@;
    $msg .= $DBI::err if ($DBI::err);
    $msg .= " ($DBI::errstr)" if ($DBI::errstr);
    $log->warning($msg) if ($log);
    return 1;
  }
  return 0;
}


# It returns 0 normally and 1 abnormally
sub ping_insert($) {
  my $self = shift;
  my $log  = $self->{logger};
  my $dbh  = $self->{dbh};
  my ( $query, $sth, $href );
  eval {
    $dbh->{RaiseError} = 1;
    $dbh->do("CREATE DATABASE IF NOT EXISTS infra");
    $dbh->do(
"CREATE TABLE IF NOT EXISTS infra.chk_masterha (`key` tinyint NOT NULL primary key,`val` int(10) unsigned NOT NULL DEFAULT '0')"
    );
    $dbh->do(
"INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestamp()"
    );
  };
  if ($@) {
    my $msg = "Got error on MySQL insert ping: ";
    undef $@;
    $msg .= $DBI::err if ($DBI::err);
    $msg .= " ($DBI::errstr)" if ($DBI::errstr);
    $log->warning($msg) if ($log);
    return 1;
  }
  return 0;
}

test

MHA profile

[server default]
manager_log=/Data/mha/log/workdir/my3306tst.log
manager_workdir=/Data/mha/workdir/my3306tst
remote_workdir=/Data/mysql/my3306/mha
master_binlog_dir=/Data/mysql/my3306/log
password=xxx
ping_interval=5
repl_password=xxx
repl_user=xxx
ssh_user=mysql
ssh_port=xxx
user=mha
master_ip_online_change_script="/usr/local/bin/master_ip_online_change"
master_ip_failover_script="master_ip_failover"

[server1]
hostname=xxx
port=3306
candidate_master=1

[server2]
hostname=xxx
port=3306
candidate_master=1

Note: Ping during testing_ The interval is set to 5 to facilitate rapid observation of switching. In actual production, it can be adjusted according to the fault tolerance of the business.

The simulation server CPU is full, and the database cannot establish a new connection
Write a simple c program as follows:

# include <stdio.h>
int main()
{
    while(1);
    return 0;
}

compile:

gcc -o out test_cpu.c

Execution:

for in in `seq 1 $(cat /proc/cpuinfo | grep "physical id" | wc -l)`; do ./out & done

In addition, run two mysqlslap pressure measurement programs:

mysqlslap -c 30000 -i 100 --detach=1 --query="select 1 from dual" --delimiter=";" -uxxx -pxxx -S /xxxx/xxx.sock
  • ping_ When type = connect, 4 connection failures trigger switching
    At this time, you can see in the MHA switching log that the output of the error reported by the connection database is as follows:
Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'waiting for initial communication packet',system error: 110)
  • ping_ When type = select, switching is not triggered

Interested students can test it by themselves

MHA health detection mechanism

Call link:

MasterMonitor.pm|MHA::MasterMonitor::main() 
-->
MasterMonitor.pm|MHA::MasterMonitor::wait_until_master_is_dead() 
-->
MasterMonitor.pm|MHA::MasterMonitor::wait_until_master_is_unreachable() 
--> 
MHA::HealthCheck::wait_until_unreachable();
-->
HealthCheck.pm|MHA::HealthCheck::ping_select(perhaps)
HealthCheck.pm|MHA::HealthCheck::ping_insert(perhaps)
HealthCheck.pm|MHA::HealthCheck::ping_connect(perhaps)

After the MHA monitoring process is started, it will continuously monitor the status of the master node. The main health detection function is wait_until_unreachable().

PS: during the startup of MHA monitoring process, the configuration file will be read and a series of checks will be carried out on the server in the configuration file, including survival status, version information, slave library configuration (read_only, relay_log_purge, log bin, replication filtering, etc.), ssh status, etc. if the check fails, it cannot be started

In this function, there will be an endless loop for continuous health detection

1. First, test the connection. If the connection is correct, return 0, otherwise return 1.

  • If the connection to MySQL is successful, the distributed lock is obtained. If the acquisition of the distributed lock fails, the returned status value is 1
  • If the connection to MySQL fails, the status value 1 and the error message of connection failure are returned. For the following cases of connection failure (1040 connections are full and 1045 permissions are denied), MHA will think that the MySQL process is normal and will not trigger switching, but will always carry out connection detection
our @ALIVE_ERROR_CODES = (
  1040,    # ER_CON_COUNT_ERROR
  1042,    # ER_BAD_HOST_ERROR
  1043,    # ER_HANDSHAKE_ERROR
  1044,    # ER_DBACCESS_DENIED_ERROR
  1045,    # ER_ACCESS_DENIED_ERROR
  1129,    # ER_HOST_IS_BLOCKED
  1130,    # ER_HOST_NOT_PRIVILEGED
  1203,    # ER_TOO_MANY_USER_CONNECTIONS
  1226,    # ER_USER_LIMIT_REACHED
  1251,    # ER_NOT_SUPPORTED_AUTH_MODE
  1275,    # ER_SERVER_IS_IN_SECURE_AUTH_MODE
);

2. After the test connection is successful, carry out health status detection (the above three methods); If the connection fails for four consecutive times, the second script will be used for detection on the fourth time (if defined). If the detection passes, it is considered that the master is dead

Key function wait_until_unreachable() Code:

# main function
sub wait_until_unreachable($) {
  my $self           = shift;
  my $log            = $self->{logger};
  my $ssh_reachable  = 2;
  my $error_count    = 0;
  my $master_is_down = 0;

  eval {
    while (1) {
      $self->{_tstart} = [gettimeofday];
      ## Determine whether a connection needs to be established
      if ( $self->{_need_reconnect} ) {
        my ( $rc, $mysql_err ) =
          $self->connect( undef, undef, undef, undef, undef, $error_count );
        if ($rc) {
          if ($mysql_err) {
            # Error code in alive_ ERROR_ When in codes, the switch is not triggered. Common problems are that the user password is incorrect and the switch will not be triggered
            if (
              grep ( $_ == $mysql_err, @MHA::ManagerConst::ALIVE_ERROR_CODES )
              > 0 )
            {
              $log->info(
"Got MySQL error $mysql_err, but this is not a MySQL crash. Continue health check.."
              );
              # Next goes directly to the next cycle
              $self->sleep_until();
              next;
            }
          }
          $error_count++;
          $log->warning("Connection failed $error_count time(s)..");
          $self->handle_failing();

          if ( $error_count >= 4 ) {
            $ssh_reachable = $self->is_ssh_reachable();
            # Returning 1 means that the main database is down, and 0 means that the main database is not down
            $master_is_down = 1 if ( $self->is_secondary_down() );
            # The main database down jumps out of the loop
            last if ($master_is_down);
            $error_count = 0;
          }
          $self->sleep_until();
          next;
        }

        # connection ok
        $self->{_need_reconnect} = 0;
        $log->info(
"Ping($self->{ping_type}) succeeded, waiting until MySQL doesn't respond.."
        );
      }
      # If Ping_ If type is connect, disconnect
      $self->disconnect_if()
        if ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_CONNECT );

      # Parent process forks one child process. The child process queries
      # from MySQL every <interval> seconds. The child process may hang on
      # executing queries.
      # DBD::mysql 4.022 or earlier does not have an option to set
      # read timeout, executing queries might take forever. To avoid this,
      # the parent process kills the child process if it won't exit within
      # <interval> seconds.

      my $child_exit_code;
      eval {
        # Call detection function
        if ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_CONNECT ) {
          $child_exit_code = $self->fork_exec( sub { $self->ping_connect() },
            "MySQL Ping($self->{ping_type})" );
        }
        elsif ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_SELECT ) {
          $child_exit_code = $self->fork_exec( sub { $self->ping_select() },
            "MySQL Ping($self->{ping_type})" );
        }
        elsif ( $self->{ping_type} eq $MHA::ManagerConst::PING_TYPE_INSERT ) {
          $child_exit_code = $self->fork_exec( sub { $self->ping_insert() },
            "MySQL Ping($self->{ping_type})" );
        }
        else {
          die "Not supported ping_type!\n";
        }
      };
      if ($@) {
        my $msg = "Unexpected error heppened when pinging! $@";
        $log->error($msg);
        undef $@;
        $child_exit_code = 1;
      }

      if ( $child_exit_code == 0 ) {

        #ping ok
        ## If the ping is successful, update the status and set the counter to 0
        $self->update_status_ok();
        if ( $error_count > 0 ) {
          $error_count = 0;
        }
        $self->kill_sec_check();
        $self->kill_ssh_check();
      }
      elsif ( $child_exit_code == 2 ) {
        $self->{_already_monitored} = 1;
        croak;
      }
      else {  
        ## Failed to create connection
        # failed on fork_exec
        $error_count++;
        $self->{_need_reconnect} = 1;
        $self->handle_failing();
      }
      $self->sleep_until();
    }
    $log->warning("Master is not reachable from health checker!");
  };
  if ($@) {
    my $msg = "Got error when monitoring master: $@";
    $log->warning($msg);
    undef $@;
    return 2 if ( $self->{_already_monitored} );
    return 1;
  }
  return 1 unless ($master_is_down);
  return ( 0, $ssh_reachable );
}

1;

Topics: MHA