How to debug system reboot/shutdown procedure?

Discussion:

How to debug system reboot/shutdown procedure?

(too old to reply)

Miroslav Zubcic

2006-11-26 13:27:22 UTC

Hi all.

After ~10 years of using Linux on the "desktop", I have decided to give a chance
to Solaris 10 on my new nforce4/AMD-opteron based opteron workstation. Just for
the sake of change. BTW I'm using Solaris on couple of company servers (U-sparc
and x86_64) without a problem.

I have SYSV init related problems with Solaris 10:

When I shutdown system with "init 0" or shutdown -g0 -i5 -y, init system often
fails to power off (or reboot if I reboot it) machine, leaving me with half of
services running and console login. Then, if I type root passwd, log in, and
give the command "who -r", it shows me that I am in runlevel 5 (or 6 if I wanted
to reboot). After that, I can type the same command again, shutdown command
prints usual warning and - nothing happens - I'm still in the console, logged as
root, able to type commands. Only way to shut down is to change runlevel to
something else (reboot if it was shutdown initialy, single-user ...) and then
repeat initial shutdown/reboot command - after that, system will (usually) do
what I told it to do in the first place.

So, where to put 'set -x' or 'exec >> /tmplog 2>&1' ? I have read /etc/rc[0,6]
and it doesn't look like there is something for debugging there. There is no
trace of some error in /var/adm/messages BTW ... nothing. How about 'truss -f -o
bla init 0' ? Maybe this is too much. :-)

I have this problem on this machine (x86), and on one Ultra SPARC machine in my
company (but not on other 4 Solaris 10 machines where I have root account). So
it looks like bug is not arch-dependent.

Has anyone some clue, hint, sugestion? Thx ...

--
Man is something that shall be overcome.
-- Friedrich Nietzsche

Andreas F. Borchert

2006-11-26 15:32:30 UTC

Post by Miroslav Zubcic
When I shutdown system with "init 0" or shutdown -g0 -i5 -y, init system often
fails to power off (or reboot if I reboot it) machine, leaving me with half of
services running and console login.

This sounds as if the shutdown procedure never completed.
Did something hang?

Post by Miroslav Zubcic
Then, if I type root passwd, log in, and
give the command "who -r", it shows me that I am in runlevel 5 (or 6 if I wanted
to reboot). After that, I can type the same command again, shutdown command
prints usual warning and - nothing happens - I'm still in the console, logged as
root, able to type commands. Only way to shut down is to change runlevel to
something else (reboot if it was shutdown initialy, single-user ...) and then
repeat initial shutdown/reboot command - after that, system will (usually) do
what I told it to do in the first place.

You are still able to ignore the whole runlevel stuff and to invoke
directly poweroff, halt, or reboot. This means that no regular shutdown
procedure will take place, just your file systems will be synced and a
short notice will be passed to syslogd -- and even this can be suppressed.

Andreas.

Miroslav Zubcic

2006-11-26 16:57:55 UTC

Post by Andreas F. Borchert
This sounds as if the shutdown procedure never completed.
Did something hang?

It looks like that, but there is nothing suspicious in process table, and also
'svcs -x' doesn't have any output.

Post by Andreas F. Borchert

Post by Miroslav Zubcic
Only way to shut down is to change runlevel to
something else (reboot if it was shutdown initialy, single-user ...) and then
repeat initial shutdown/reboot command - after that, system will (usually) do
what I told it to do in the first place.

You are still able to ignore the whole runlevel stuff and to invoke
directly poweroff, halt, or reboot.

Yes, this works but it is ugly. I have postgres and other stuff which doesn't
like this.

Post by Andreas F. Borchert
This means that no regular shutdown
procedure will take place, just your file systems will be synced and a
short notice will be passed to syslogd -- and even this can be suppressed.

I do not want errors to be suppressed. I *want* to see errors logged when they
happen.

--
Man is something that shall be overcome.
-- Friedrich Nietzsche

Andreas F. Borchert

2006-11-26 18:34:03 UTC

Post by Miroslav Zubcic

Post by Andreas F. Borchert
This sounds as if the shutdown procedure never completed.
Did something hang?

It looks like that, but there is nothing suspicious in process table, and also
'svcs -x' doesn't have any output.

Is /sbin/rc[0-6] still running?

Andreas.

Miroslav Zubcic

2006-11-27 08:30:32 UTC

Post by Andreas F. Borchert

Post by Miroslav Zubcic
It looks like that, but there is nothing suspicious in process table, and also
'svcs -x' doesn't have any output.

Is /sbin/rc[0-6] still running?

No. Only /usr/openwin/bin/sys-suspend -h -d :0 is started, and this is all what
is different in process table.

BTW I have
Nov 27 02:33:31 hegel ip: [ID 646971 kern.notice] ip_create_dl: hw addr length = 0

On console and syslog when I start shutdown or init 0/6 this is logged, but it
doesnt tells me much about what is wrong, except that networking is involved.

--
Man is something that shall be overcome.
-- Friedrich Nietzsche

Miroslav Zubcic

2006-11-29 10:11:23 UTC

Post by Miroslav Zubcic

Post by Andreas F. Borchert

Post by Miroslav Zubcic
It looks like that, but there is nothing suspicious in process table, and also
'svcs -x' doesn't have any output.

Is /sbin/rc[0-6] still running?

No. Only /usr/openwin/bin/sys-suspend -h -d :0 is started, and this is all what
is different in process table.
BTW I have
Nov 27 02:33:31 hegel ip: [ID 646971 kern.notice] ip_create_dl: hw addr length = 0
On console and syslog when I start shutdown or init 0/6 this is logged, but it
doesnt tells me much about what is wrong, except that networking is involved.

Solved in a way ...

I have DSL pppoe client from rp-pppoe (Solaris pppoe doesn't support MSS
clamping) and one simple init script linked with S and K links across
rcX.d runlevels:

#!/bin/sh

case $1 in

start)
/opt/lsw/sbin/pppoe-start &
sleep 12 && /opt/tools/bin/fwdyn &
;;
stop)
/opt/lsw/sbin/pppoe-stop
;;
status) /opt/lsw/sbin/pppoe-status
;;
*) echo "?" ; exit 1
;;

esac

If i stop pppoe daemon with this script before shutdown, shutdown will
proceed normaly. Init script for rp-pppoe is doing it's job ok, and pppoe
and Solaris pppd will be shut down. However, if exit status from script
(/opt/lsw/sbin/pppoe-stop) is for whatever reason (killing daemon twice
for example to be sure) not 0, SYSV init system is getting confused and
does not proceed with shutdown.

I have put "exit 0" on the end of the script as a workaround, and now
machine is rebooted or shutdown (almost always) without problem.

SYSV process on Solaris is IMHO very fragile. It shouldn't stop
unrecoverably because one script is not giving exist status 0.

I say "always", because sometimes /opt cannot be unmounted on the end of
shutdown process because it is busy. I will like to know does Solaris
tries to umount partitions repeadetly, does it try after that to remount
partitions read only, maybe lockfs them or what ... or system will be
rebooted with some partitions still mounted? It can work for the next boot
500 times, but 501 time I can find myself in single user mode recovering
corruput fs.

--
Man is something that shall be overcome.
-- Friedrich Nietzsche

Darren Dunham

2006-12-04 07:51:56 UTC

Post by Miroslav Zubcic
Solved in a way ...
I have DSL pppoe client from rp-pppoe (Solaris pppoe doesn't support MSS
clamping) and one simple init script linked with S and K links across
#!/bin/sh
case $1 in
start)
/opt/lsw/sbin/pppoe-start &
sleep 12 && /opt/tools/bin/fwdyn &
;;
stop)
/opt/lsw/sbin/pppoe-stop
;;
status) /opt/lsw/sbin/pppoe-status
;;
*) echo "?" ; exit 1
;;
esac
If i stop pppoe daemon with this script before shutdown, shutdown will
proceed normaly. Init script for rp-pppoe is doing it's job ok, and pppoe
and Solaris pppd will be shut down. However, if exit status from script
(/opt/lsw/sbin/pppoe-stop) is for whatever reason (killing daemon twice
for example to be sure) not 0, SYSV init system is getting confused and
does not proceed with shutdown.
I have put "exit 0" on the end of the script as a workaround, and now
machine is rebooted or shutdown (almost always) without problem.
SYSV process on Solaris is IMHO very fragile. It shouldn't stop
unrecoverably because one script is not giving exist status 0.

I'm unable to duplicate this behavior. I've create scripts that sleep
for a few seconds, then exit 4.

The milestone logs have nice messages about the startup script exiting
with code 4, then they go and continue. No services stay offline.

[...]
[ Dec 3 23:43:12 Executing stop method
[ Dec 3 23:43:12 Executing stop method (null) ]
[ Dec 3 23:43:51 Enabled. ]
[ Dec 3 23:44:01 Executing start method ("/sbin/rc2 start") ]
Executing legacy init script "/etc/rc2.d/S20sysetup".
Legacy init script "/etc/rc2.d/S20sysetup" exited with return code 0.
Executing legacy init script "/etc/rc2.d/S20test".
Sleeping for 20 seconds.
exiting with code 4
Legacy init script "/etc/rc2.d/S20test" exited with return code 4.
Executing legacy init script "/etc/rc2.d/S70uucp".
Legacy init script "/etc/rc2.d/S70uucp" exited with return code 0.
Executing legacy init script "/etc/rc2.d/S72autoinstall".
Legacy init script "/etc/rc2.d/S72autoinstall" exited with return code 0.
Executing legacy init script "/etc/rc2.d/S73cachefs.daemon".
Legacy init script "/etc/rc2.d/S73cachefs.daemon" exited with return code 0.
Executing legacy init script "/etc/rc2.d/S89PRESERVE".
Legacy init script "/etc/rc2.d/S89PRESERVE" exited with return code 1.
[...]

Post by Miroslav Zubcic
I say "always", because sometimes /opt cannot be unmounted on the end of
shutdown process because it is busy. I will like to know does Solaris
tries to umount partitions repeadetly, does it try after that to remount
partitions read only, maybe lockfs them or what ... or system will be
rebooted with some partitions still mounted?

Solaris will try to unmount some filesystems just as an artifact of
stopping certain services. Like as the automounter stops, it tries to
unmount those filesystems. However filesystems in use cannot be
unmounted at that point.

As the kernel completes shutdown, it will sync and unmount any
filesystems.

Post by Miroslav Zubcic
It can work for the next boot
500 times, but 501 time I can find myself in single user mode recovering
corruput fs.

If you're getting a corrupt filesystem, something is *very* wrong. This
is solaris 10, so I presume you're using UFS logging by default?
Perhaps a device error of some type?

--
Darren Dunham ***@taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >

Miroslav Zubcic

2006-12-04 10:06:20 UTC

Post by Darren Dunham
I'm unable to duplicate this behavior. I've create scripts that sleep
for a few seconds, then exit 4.

Hm. My script is working perfectly if it is called from command line like
/etc/init.s/pppoe stop
No errors, pppoe daemon shuts down, exit status is 0 (even if I remove explicit
"exit 0" from the end of script).

I was wrong about exit status - it didn't solved my problem. Only way to
workaround a problem is to run pppoe stop from command line before running
shutdown -g 0 -i 5 -y.

Post by Darren Dunham
The milestone logs have nice messages about the startup script exiting
with code 4, then they go and continue. No services stay offline.

Thanks for info! I wasn't aware of existance of this logs in /var/svc/log.

Post by Darren Dunham
[...]
[ Dec 3 23:43:12 Executing stop method
[ Dec 3 23:43:12 Executing stop method (null) ]
[ Dec 3 23:43:51 Enabled. ]

Hm hm ... My init logs are looking good too:

stu 22 02:33:45 Executing /sbin/rc5 stop
Executing legacy init script "/etc/rc0.d/K03samba".
Legacy init script "/etc/rc0.d/K03samba" exited with return code 0.
Executing legacy init script "/etc/rc0.d/K05volmgt".
Legacy init script "/etc/rc0.d/K05volmgt" exited with return code 0.
lsvcrun: Service matching "/etc/rc0.d/K06mipagent" doesn't seem to be running.

[snip, skip ...]

Executing legacy init script "/etc/rc0.d/K40pppoe".
Killing pppd (486)
Terminated
Killing pppoe-connect (474)

It looks like everything is OK, but it's not - it is LAST log record. No
services are stopped AFTER this script runs. Strange huh? But if I run this
script from command line just before I type shutdown, then I get this:

Executing legacy init script "/etc/rc0.d/K40pppoe".
pppoe-stop: The pppoe-connect script (PID 465) appears to have died

... and then shutdown continues ok.

I do not know what the f...

Post by Darren Dunham
Solaris will try to unmount some filesystems just as an artifact of
stopping certain services. Like as the automounter stops, it tries to
unmount those filesystems. However filesystems in use cannot be
unmounted at that point.

I have my own services in /opt/lsw and they have proper smf and init scripts.
They are terminated in shutdown process. I suspect maybe some nasty thing like
gconfd-2 is holding file descriptor (Firefox runs it, I do not use gnome) or
something like that. Maybe shutdown procedure should send SIGKILL on the end if
any services or programs remained alive?

Post by Darren Dunham
As the kernel completes shutdown, it will sync and unmount any
filesystems.

Hm. This will be good, but last messages on the console are that filesystems
were busy.

Post by Darren Dunham

Post by Miroslav Zubcic
It can work for the next boot
500 times, but 501 time I can find myself in single user mode recovering
corruput fs.

If you're getting a corrupt filesystem, something is *very* wrong.

No I wasn't in situation to have fsck session. But I'm /afraid/ that some time
this will happen despite of logging/journaling if shutdown procedure fails to
cleanly umount all filesystems. Usually /opt and /export are reported as busy.
That is, it prints on the console "in.ndpd terminated ... (or something like
that) and then nothing for 1 minute, and after that "/opt busy", "/export busy"
... I can see that on the console for one second before system is powered off or
rebooted.

Post by Darren Dunham
This is solaris 10, so I presume you're using UFS logging by default?
Perhaps a device error of some type?

Naah ... Sorry. I was not very clear in my first description. I mentioned fsck
as an option, because even filesystem which is logging can become sometimes
corrupted - for example once in 300 reboots maybe - I think.

--
Man is something that shall be overcome.
-- Friedrich Nietzsche

Darren Dunham

2006-12-04 22:50:40 UTC

Post by Miroslav Zubcic
Executing legacy init script "/etc/rc0.d/K40pppoe".
Killing pppd (486)
Terminated
Killing pppoe-connect (474)
It looks like everything is OK, but it's not - it is LAST log record. No
services are stopped AFTER this script runs. Strange huh? But if I run this
Executing legacy init script "/etc/rc0.d/K40pppoe".
pppoe-stop: The pppoe-connect script (PID 465) appears to have died
... and then shutdown continues ok.
I do not know what the f...

This is just a wild guess, but my first assumption is that the K40pppoe
script is trying to kill something, and it might be killing too much
(like one of the shutdown scripts or something...). If you run it by
hand, there's no parent script, so the problem does not arise.

Just something to investigate. You might add a 'set -x' to it to see if
the extra logging is helpful.

Post by Miroslav Zubcic

Post by Darren Dunham
Solaris will try to unmount some filesystems just as an artifact of
stopping certain services. Like as the automounter stops, it tries to
unmount those filesystems. However filesystems in use cannot be
unmounted at that point.

I have my own services in /opt/lsw and they have proper smf and init scripts.
They are terminated in shutdown process. I suspect maybe some nasty thing like
gconfd-2 is holding file descriptor (Firefox runs it, I do not use gnome) or
something like that. Maybe shutdown procedure should send SIGKILL on the end if
any services or programs remained alive?

The kernel will reach a point where no programs run anyway.

Post by Miroslav Zubcic

Post by Darren Dunham
As the kernel completes shutdown, it will sync and unmount any
filesystems.

Hm. This will be good, but last messages on the console are that filesystems
were busy.

That should still be okay.

Post by Miroslav Zubcic

Post by Darren Dunham

Post by Miroslav Zubcic
It can work for the next boot
500 times, but 501 time I can find myself in single user mode recovering
corruput fs.

If you're getting a corrupt filesystem, something is *very* wrong.

No I wasn't in situation to have fsck session. But I'm /afraid/ that
some time this will happen despite of logging/journaling if shutdown
procedure fails to cleanly umount all filesystems. Usually /opt and
/export are reported as busy. That is, it prints on the console
"in.ndpd terminated ... (or something like that) and then nothing for
1 minute, and after that "/opt busy", "/export busy" ... I can see
that on the console for one second before system is powered off or
rebooted.

I believe those are still just messages from user programs/scripts
trying to unmount. The final unmount as the kernel stops should still
succeed.

Post by Miroslav Zubcic

Post by Darren Dunham
This is solaris 10, so I presume you're using UFS logging by default?
Perhaps a device error of some type?

Naah ... Sorry. I was not very clear in my first description. I
mentioned fsck as an option, because even filesystem which is logging
can become sometimes corrupted - for example once in 300 reboots maybe
- I think.

Anything can become corrupted, but there should be a reason: hardware
errors, kernel or filesystem bugs, etc. Number of reboots shouldn't
have anything to do with it.

Make sure you don't try to fsck filesystems after they've been mounted
read/write. Even when you do a single-user boot, several filesystems
may already be mounted read/write and it is not safe to run fsck on
them.

--
Darren Dunham ***@taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >

Miroslav Zubcic

2006-12-12 18:41:14 UTC

(Sorry for a late answer Darren, I was busy ...)

Post by Darren Dunham
This is just a wild guess, but my first assumption is that the K40pppoe
script is trying to kill something, and it might be killing too much
(like one of the shutdown scripts or something...). If you run it by
hand, there's no parent script, so the problem does not arise.

I was thinking that also. It is natural to suspect something like that.
However, I double checked K40pppoe AND /opt/lsw/sbin/pppoe-stop including
pppoe.conf file which is sourced (shell variables) by pppoe-stop. Nothing there.
It kills pppoe-start, pppoe-connect and pppd daemon and this is done right - in
pure bourne shell with if...then...else checks in secure way.

Post by Darren Dunham
Just something to investigate. You might add a 'set -x' to it to see if
the extra logging is helpful.

Last weekend I played again with this problem. Temporary I have modified
/sbin/rc6 like that:

exec >> /root/rc6.stdout
exec 2>> /root/rc6.stderr

...

for f in /etc/rc0.d/K*; do
f2="`basename $f`"
if [ -s $f ]; then
case $f in
*.sh) /lib/svc/bin/lsvcrun -s $f stop;;
*) /usr/bin/truss -o /root/bla.$f2 \
-v all -f /lib/svc/bin/lsvcrun $f stop;;
esac
fi
done

Here are results:

rc6.stdout:
...
Legacy init script "/etc/rc0.d/K34ncalogd" exited with return code 0.
Executing legacy init script "/etc/rc0.d/K40pppoe".
Killing pppd (487)
Killing pppoe-connect (478)

rc6.stderr:
...
lsvcrun: Service matching "/etc/rc0.d/K34ncalogd" doesn't seem to be running.
+ basename /etc/rc0.d/K40pppoe
f2=K40pppoe
+ [ -s /etc/rc0.d/K40pppoe ]
+ /usr/bin/truss -o /root/bla.K40pppoe -v all -f /lib/svc/bin/lsvcrun
/etc/rc0.d/K40pppoe stop

No errors, but pppoe stop is the last script executed. After that reboot process
dies.

Here is relevant part of bla.K40pppoe - truss(1) output:

fstat64(1, 0x08046F70) = 0
write(1, " E x e c u t i n g l e".., 52) = 52
schedctl() = 0xFEFC1000
fork1() = 2095
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
close(5) = 0
read(3, 0x08047CFC, 1) = 0
waitid(P_PID, 2095, 0x08047BF0, WEXITED|WTRAPPED) (sleeping...)
Received signal #15, SIGTERM, in waitid() [default]
siginfo: SIGTERM pid=487 uid=0

I repeat: pppoe scripts are NOT killing anything other than pids from /var/run
from pppd, pppoe etc ... PID 487 was pppd, NOT lsvcrun.

On the end of pppoe-stop (I have trussed it also) is:

1966: kill(487, SIGTERM) = 0

488 was NOT pid of lsvcrun process.

I think this is something wrong with this undocumented lsvcrun binary. Pid
number is small, because it is created during boot procedure (pppoe software
forks SUN's pppd(1M) ...).

If somebody is interested, it can repeat this problem - download and compile
rp-pppoe http://www.roaringpenguin.com/files/download/rp-pppoe-3.8.tar.gz
Configure it for normal DSL line, and put simple stop/start init script in
runlevels. I can repeat this problem on two machines - x86 at home, and one
older Ultra Sparc.

I really don't know how to debug this. I have 10 years of Linux/Unix behind me,
I'm not a beginner ... I have asked collegue, described them a problem, we even
looked scripts on USparc system together - nothing ... total mistery.

--
Man is something that shall be overcome.
-- Friedrich Nietzsche

James Carlson

2006-12-13 12:29:35 UTC

Post by Miroslav Zubcic
fstat64(1, 0x08046F70) = 0
write(1, " E x e c u t i n g l e".., 52) = 52
schedctl() = 0xFEFC1000
fork1() = 2095
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
close(5) = 0
read(3, 0x08047CFC, 1) = 0
waitid(P_PID, 2095, 0x08047BF0, WEXITED|WTRAPPED) (sleeping...)
Received signal #15, SIGTERM, in waitid() [default]
siginfo: SIGTERM pid=487 uid=0
I repeat: pppoe scripts are NOT killing anything other than pids from /var/run
from pppd, pppoe etc ... PID 487 was pppd, NOT lsvcrun.

What options are you using with pppd?

By default, pppd kills everything in its process group if there's a
"connect" or "disconnect" script running when SIGTERM is received. If
you're using the "nodetach" option for something that ought to be
running as a daemon, then you could have strange side-effects like
this.

--
James Carlson, KISS Network <***@sun.com>
Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677

Miroslav Zubcic

2006-12-18 19:00:04 UTC

= 0

Post by James Carlson

Post by Miroslav Zubcic
waitid(P_PID, 2095, 0x08047BF0, WEXITED|WTRAPPED) (sleeping...)
Received signal #15, SIGTERM, in waitid() [default]
siginfo: SIGTERM pid=487 uid=0
I repeat: pppoe scripts are NOT killing anything other than pids from /var/run
from pppd, pppoe etc ... PID 487 was pppd, NOT lsvcrun.

What options are you using with pppd?

Nothing suspicious:

(ps -ef)
/usr/bin/pppd pty /opt/lsw/lib/pppoe/pppoe -p /var/run/pppoe.conf-pppoe.pid.ppp

$ cat /etc/ppp/peers/htnet-dsl
persist
user "xxxxx"
noauth
noipdefault
defaultroute
noaccomp
noccp
nobsdcomp
nodeflate
nopcomp
novj
novjccomp

Post by James Carlson
By default, pppd kills everything in its process group if there's a
"connect" or "disconnect" script running when SIGTERM is received. If
you're using the "nodetach" option for something that ought to be
running as a daemon, then you could have strange side-effects like
this.

Nope.

$ ptree `pgrep pppd`
491 /bin/sh /opt/lsw/sbin/pppoe-connect
498 /usr/bin/pppd pty /opt/lsw/lib/pppoe/pppoe -p
/var/run/pppoe.conf-pppoe.pid.ppp
528 sh -c /opt/lsw/lib/pppoe/pppoe -p /var/run/pppoe.conf-pppoe.pid.pppoe
-I elxl0
529 /opt/lsw/lib/pppoe/pppoe -p /var/run/pppoe.conf-pppoe.pid.pppoe -I
elxl0 -T 80

--
Man is something that shall be overcome.
-- Friedrich Nietzsche

Darren Dunham

2006-12-13 17:11:41 UTC

Post by Miroslav Zubcic
...
lsvcrun: Service matching "/etc/rc0.d/K34ncalogd" doesn't seem to be running.
+ basename /etc/rc0.d/K40pppoe
f2=K40pppoe
+ [ -s /etc/rc0.d/K40pppoe ]
+ /usr/bin/truss -o /root/bla.K40pppoe -v all -f /lib/svc/bin/lsvcrun
/etc/rc0.d/K40pppoe stop
No errors, but pppoe stop is the last script executed. After that
reboot process dies.

That's pretty odd.

Some other ideas:

Can you recreate this in a zone? I think the lsvcrun stuff should be
similar there. You'd have the advantage of being able to test the
shutdown behavior without acutally shutting down the machine. And you
could truss running processes live from the global zone.

I'd actually be more interested at this point in trussing the shutdown
shell (lsvcrun parent). It must be dying. Does it do so through an
exit or some signal?

--
Darren Dunham ***@taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >

m***@gmail.com

2006-12-04 08:20:03 UTC

Hi all,

I have been using solaris 10 on x86, 1GB RAM as a desktop (KDE) for
regular use.I do have similar problem.

i hav to use poweroff . as shutdown and init dont work properly in KDE.
But, In JDE (Java Desktop ), graphical shutdown give correct result. I
feel it uses poweroff.

What shall i do to shutdown my machin properly in KDE.

Post by Miroslav Zubcic

Post by Miroslav Zubcic

Post by Andreas F. Borchert

Post by Miroslav Zubcic
It looks like that, but there is nothing suspicious in process table, and also
'svcs -x' doesn't have any output.

Is /sbin/rc[0-6] still running?

No. Only /usr/openwin/bin/sys-suspend -h -d :0 is started, and this is all what
is different in process table.
BTW I have
Nov 27 02:33:31 hegel ip: [ID 646971 kern.notice] ip_create_dl: hw addr length = 0
On console and syslog when I start shutdown or init 0/6 this is logged, but it
doesnt tells me much about what is wrong, except that networking is involved.Solved in a way ...

I have DSL pppoe client from rp-pppoe (Solaris pppoe doesn't support MSS
clamping) and one simple init script linked with S and K links across
#!/bin/sh
case $1 in
start)
/opt/lsw/sbin/pppoe-start &
sleep 12 && /opt/tools/bin/fwdyn &
;;
stop)
/opt/lsw/sbin/pppoe-stop
;;
status) /opt/lsw/sbin/pppoe-status
;;
*) echo "?" ; exit 1
;;
esac
If i stop pppoe daemon with this script before shutdown, shutdown will
proceed normaly. Init script for rp-pppoe is doing it's job ok, and pppoe
and Solaris pppd will be shut down. However, if exit status from script
(/opt/lsw/sbin/pppoe-stop) is for whatever reason (killing daemon twice
for example to be sure) not 0, SYSV init system is getting confused and
does not proceed with shutdown.
I have put "exit 0" on the end of the script as a workaround, and now
machine is rebooted or shutdown (almost always) without problem.
SYSV process on Solaris is IMHO very fragile. It shouldn't stop
unrecoverably because one script is not giving exist status 0.
I say "always", because sometimes /opt cannot be unmounted on the end of
shutdown process because it is busy. I will like to know does Solaris
tries to umount partitions repeadetly, does it try after that to remount
partitions read only, maybe lockfs them or what ... or system will be
rebooted with some partitions still mounted? It can work for the next boot
500 times, but 501 time I can find myself in single user mode recovering
corruput fs.
--
Man is something that shall be overcome.
-- Friedrich Nietzsche

Darren Dunham

2006-12-04 22:52:09 UTC

Post by m***@gmail.com
Hi all,
I have been using solaris 10 on x86, 1GB RAM as a desktop (KDE) for
regular use.I do have similar problem.
i hav to use poweroff . as shutdown and init dont work properly in KDE.

Can you give more information about what they do instead? What behavior
do you see? What command line do you run? Do any 'rc' scripts appear
in 'ps' output?

--
Darren Dunham ***@taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >

Tim Bradshaw

2006-11-26 18:19:49 UTC

Post by Miroslav Zubcic
Has anyone some clue, hint, sugestion? Thx ...

Chances are it is some script which is hanging on the way down. I'd
use ptree etc to find it (it should be fairly easy to locate whatever
runs the scripts, though it is probably not init now but some SMF
thing), and then poke at it to see what it's waiting for.

--tim

15 Replies
28 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Miroslav Zubcic 2006-11-26 13:27:22 UTC

Andreas F. Borchert 2006-11-26 15:32:30 UTC

Miroslav Zubcic 2006-11-26 16:57:55 UTC

Andreas F. Borchert 2006-11-26 18:34:03 UTC

Miroslav Zubcic 2006-11-27 08:30:32 UTC

Miroslav Zubcic 2006-11-29 10:11:23 UTC

Darren Dunham 2006-12-04 07:51:56 UTC

Miroslav Zubcic 2006-12-04 10:06:20 UTC

Darren Dunham 2006-12-04 22:50:40 UTC

Miroslav Zubcic 2006-12-12 18:41:14 UTC

James Carlson 2006-12-13 12:29:35 UTC

Miroslav Zubcic 2006-12-18 19:00:04 UTC

Darren Dunham 2006-12-13 17:11:41 UTC

m***@gmail.com 2006-12-04 08:20:03 UTC

Darren Dunham 2006-12-04 22:52:09 UTC

Tim Bradshaw 2006-11-26 18:19:49 UTC

about - legalese

Loading...