Linux – “service thesqld stop” times out (then found out that “thesqld dead but subsys locked”)

bashcentosdatabaselinuxMySQL

I installed mysql and server via yum on my 64bit CentOS 5 server. It starts up fine but when I try to stop it it stalls out and then I have to "Ctrl-C" it. Then I run a "service mysqld status" and it shows:

mysqld dead but subsys locked

I run ps aux and mysql is nowhere to be found. Starting mysqld again via "service mysqld start" works fine. Attempting to stop it creates the same problem.

I then realized that /var/lock/subsys/mysqld still exist. When running mysqld I checked /var/run/mysqld/mysqld.pid and it matched with the pid of the running service.

I tried reinstalling mysql and deleting all files and configs but to no avail.

What to do?

EDIT:

I added some echo statements in the /etc/init.d/mysqld file, particularly in the stop function:

stop(){
        if [ ! -f "$mypidfile" ]; then
            # not running; per LSB standards this is "ok"
            action $"Stopping $prog: " /bin/true
            return 0
        fi  
        echo "beginning stop sequence"
        MYSQLPID=`cat "$mypidfile"`
        if [ -n "$MYSQLPID" ]; then
            /bin/kill "$MYSQLPID" >/dev/null 2>&1
            echo "killing pid $MYSQLPID"
            ret=$?
            if [ $ret -eq 0 ]; then
                echo "return code $ret after kill attempt"
                TIMEOUT="$STOPTIMEOUT"
                echo "timeout is set to $STOPTIMEOUT"
                while [ $TIMEOUT -gt 0 ]; do
                    /bin/kill -0 "$MYSQLPID" >/dev/null 2>&1 || break
                    sleep 1
                    let TIMEOUT=${TIMEOUT}-1
                    echo "timeout is now $TIMEOUT"
                done
                if [ $TIMEOUT -eq 0 ]; then
                    echo "Timeout error occurred trying to stop MySQL Daemon."
                    ret=1
                    action $"Stopping $prog: " /bin/false
                else
                    echo "attempting to del lockfile: $lockfile"
                    rm -f $lockfile
                    rm -f "$socketfile"
                    action $"Stopping $prog: " /bin/true
                fi
            else
                action $"Stopping $prog: " /bin/false
            fi
        else
            # failed to read pidfile, probably insufficient permissions
            action $"Stopping $prog: " /bin/false
            ret=4
        fi
        return $ret
}

This is the result I get when I try to stop the service:

[root@server]# service mysqld stop
beginning stop sequence
killing pid 9145
return code 0 after kill attempt
timeout is set to 60
timeout is now 59
timeout is now 58
timeout is now 57
timeout is now 56
timeout is now 55
timeout is now 54
timeout is now 53
timeout is now 52
timeout is now 51
timeout is now 50
timeout is now 49

From looking at the code it seems to me that it will never break out of that while loop and will not be able to delete the lock file. Am I interpreting this wrong? I checked the same file on my other server and it uses the same code. I am dumbfounded.

EDIT:
In the while loop part

 /bin/kill -0 "$MYSQLPID" >/dev/null 2>&1 || break

For some reason it is not recognizing the return code. When service mysqld stop is called, the process has already been killed but not sure why it isn't allowing the loop to break out.

EDIT:
Further testing shows some weird behavior between calling /bin/kill and just calling kill, they apparently return different codes, why??????:

[root@server]# /bin/kill 25200
kill 25200: No such process
[user@server]# echo ${?}
0
[root@server]# kill 25200
-bash: kill: (25200) - No such process
[root@server]# echo ${?}
1

EDIT: I logged on as a non-root user and tried executing "kill" and "/bin/kill" with a surprising results:

[notroot@server ~]$ kill -0 23232
-bash: kill: (23232) - No such process
[notroot@server ~]$ echo $?
1
[notroot@server ~]$ /bin/kill -0 23232
kill 23232: No such process
(No info could be read for "-p": geteuid()=501 but you should be root.)
[notroot@server ~]$ echo $?
0

The "No info could be read" error does not show up in my other servers when executing kill and bin/kill as a nonroot user.

EDIT: Added logging described by quanta and also checked out the mysql log:

After a start and stop, mysql log shows this:

110918 00:11:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
110918  0:11:28 [Note] Plugin 'FEDERATED' is disabled.
110918  0:11:28  InnoDB: Initializing buffer pool, size = 16.0M
110918  0:11:28  InnoDB: Completed initialization of buffer pool
110918  0:11:29  InnoDB: Started; log sequence number 0 44233
110918  0:11:29 [Warning] 'user' entry 'root@server' ignored in --skip-name-resolve mode.
110918  0:11:29 [Warning] 'user' entry '@server' ignored in --skip-name-resolve mode.
110918  0:11:29 [Note] Event Scheduler: Loaded 0 events
110918  0:11:29 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.58-ius'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Distributed by The IUS Community Project
110918  0:11:34 [Note] /usr/libexec/mysqld: Normal shutdown

110918  0:11:34 [Note] Event Scheduler: Purging the queue. 0 events
110918  0:11:34  InnoDB: Starting shutdown...
110918  0:11:39  InnoDB: Shutdown completed; log sequence number 0 44233
110918  0:11:39 [Note] /usr/libexec/mysqld: Shutdown complete

110918 00:11:39 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

Then in the tmp/mysql.log:

kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process

I stopped the stop process midway so I don't have to wait for the timeout. Looks like the process was killed. The problem, I think, is still with the different return codes from "kill" and "/bin/kill"

Best Answer

First things first: a very well-done, systematic and thorough debugging, good job.

On my RHEL 5.6 box I always get a return code of 1 if I try to kill a non-existing pid. I tried as both root and a non-privileged user, both with full path and a with just the command name. I also get only terse kill XXX: No such process, with no elaborate error messages.

It may be a good idea to run rpm -Vv util-linux and see if somebody didn't replace /bin/kill with a new and improved version. Even if rpm verification says the file is pristine, I'd try renaming /bin/kill and copying over a binary from a working machine. If the file replacement helps and you don't uncover a legitimate the source of the change, then regardless of output of rpm verification I'd assume the machine was compromised.

Related Topic