AWS Redis连接耗时的问题排查。

排查过程记录如下:

猜测1:php-fpm没有长连接,或长连接没发挥作用?
猜测2:aws的Redis本来连接耗时就有那么长?

验证:

ubuntu@ip-172-30-0-135:~$ ps -ef|grep php-fpm
......
www-data 29402 14023  0 Sep21 ?        00:00:01 php-fpm: pool www
www-data 29405 14023  0 Sep21 ?        00:00:01 php-fpm: pool www
www-data 30353 14023  0 Sep21 ?        00:00:01 php-fpm: pool www
www-data 31393 14023  0 Sep21 ?        00:00:01 php-fpm: pool www
ubuntu@ip-172-30-0-135:~$ sudo  lsof -n -p 29405
......
php-fpm7. 29405 www-data    6u  IPv4 1802702246      0t0       TCP 172.30.0.135:60678->172.30.0.54:6666 (ESTABLISHED)
php-fpm7. 29405 www-data    7u  IPv4 1803131110      0t0       TCP 172.30.0.135:42250->172.30.0.103:6666 (ESTABLISHED)
php-fpm7. 29405 www-data    8u  IPv4 1803403028      0t0       TCP 172.30.0.135:55208->172.30.0.187:6666 (ESTABLISHED)
php-fpm7. 29405 www-data    9u  IPv4 1803403030      0t0       TCP 172.30.0.135:51670->172.30.0.139:6666 (ESTABLISHED)
php-fpm7. 29405 www-data   10u  IPv4 1804681394      0t0       TCP 172.30.0.135:55946->172.30.0.88:6666 (ESTABLISHED)
php-fpm7. 29405 www-data   11u  IPv4 1803411631      0t0       TCP 172.30.0.135:46534->172.30.0.116:6666 (ESTABLISHED)
php-fpm7. 29405 www-data   12u  IPv4 1804692070      0t0       TCP 172.30.0.135:36904->172.30.0.244:6666 (ESTABLISHED)
php-fpm7. 29405 www-data   13u  IPv4 1804692072      0t0       TCP 172.30.0.135:40814->172.30.0.26:6666 (ESTABLISHED)
php-fpm7. 29405 www-data   14u  IPv4 1804692470      0t0       TCP 172.30.0.135:34046->172.30.0.147:6666 (ESTABLISHED)
php-fpm7. 29405 www-data   15u  IPv4 1804692471      0t0       TCP 172.30.0.135:50280->172.30.0.225:7379 (ESTABLISHED)
php-fpm7. 29405 www-data   16u  IPv4 1804692473      0t0       TCP 172.30.0.135:53992->172.30.0.85:6666 (ESTABLISHED)
php-fpm7. 29405 www-data   17u  IPv4 1804692475      0t0       TCP 172.30.0.135:52750->172.30.0.154:6666 (ESTABLISHED)
php-fpm7. 29405 www-data   18u  IPv4 1804738435      0t0       TCP 172.30.0.135:33916->172.30.0.58:6666 (ESTABLISHED)
php-fpm7. 29405 www-data   19u  IPv4 1804692479      0t0       TCP 172.30.0.135:47340->172.30.0.75:6666 (ESTABLISHED)
php-fpm7. 29405 www-data   21u  IPv4 1804697603      0t0       TCP 172.30.0.135:59444->172.30.0.195:6666 (ESTABLISHED)
php-fpm7. 29405 www-data   22u  IPv4 1804738439      0t0       TCP 127.0.0.1:33540->127.0.0.1:6379 (ESTABLISHED)
php-fpm7. 29405 www-data   23u  IPv4 1804697607      0t0       TCP 172.30.0.135:58396->172.30.0.28:6666 (ESTABLISHED)
php-fpm7. 29405 www-data   29u  IPv4 1804797775      0t0       TCP 172.30.0.135:38474->172.30.0.68:6666 (ESTABLISHED)

结论:随意挑一个 php-fpm 进程,lsof 可以查看到当前Redis连接,过几分钟再次 lsof 连接仍然存在,且本地端口无变化,故推断 pconnect 长连接是有效的。

猜测2:aws的Redis本来连接耗时就有那么长?

脚本测试 dummy-Redis的pconnect耗时

root@ip-172-30-0-135:/ebs_web/dummyadmin/build/www/app-access# ./strace-redis-pconnect.sh 
strace: Process 1625 attached
string(48) "pconnect耗时:34465us    main总耗时:37490us"

用strace追踪进程执行过程记录:strace-1625.log

image

结论:跟踪系统调用过程,测试后发现Redis连接会先到 172.30.0.2 去做域名解析,主要耗时都在这个过程。这个耗时非必现且无规律,pconnect 耗时超过 1000us(即1ms)基本都是此问题。

排查脚本

strace-redis-pconnect.sh

#!/bin/bash
/usr/bin/php -d extension=xhprof.so -d extension=redis.so xhprof-redis-pconnect.php 2>&1 &
pid=$(ps -ef|grep xhprof-redis-pconnect.php | grep -v grep | awk '{print $2}')
sleep 1s
strace -o strace-$pid.log -s 1024 -T -p $pid

xhprof-redis-pconnect.php

<?php
define('POKER', TRUE);
define( 'PLATFORM', 'AdminID' );
require_once( './threepoker/Application/config.api.php' );
sleep(3);  //挂起进程,并启用 strace 开始追踪
xhprof_enable();
$baseInfo = Lib_Load::RedisAmSlave()->get('UINF|1234698', true);
$xhprof_data = xhprof_disable();
var_dump('pconnect耗时:' . $xhprof_data['Lib_Redis::__construct==>Redis::pconnect']['wt'] . "us\tmain总耗时:" . $xhprof_data['main()']['wt'] . 'us');
?>

解决方案:

将 Redis服务器域名加入 /etc/hosts 或者直接用 IP 连接,避免域名解析的过程。

加入后测试:

root@ip-172-30-0-135:/ebs_web/dummyadmin/build/www/app-access# ./strace-redis-pconnect.sh 
strace: Process 13829 attached
string(45) "pconnect耗时:768us    main总耗时:3731us"

用strace追踪进程执行过程记录:strace-13829.log

image

添加 hosts 后,已经没有了域名解析过程,经多次测试 pconnect 耗时稳定在 1000us 左右 后记:

image