What to do if a partition hang in error state?


Some times a partition hang in an error state and no new job can be started.

The output of llstat in this situation looks like:

<11> R10 13496 jzam0609 error 0 0 0 V  111   0.59
<12> R10 13498 jzam0609 error 0 0 0 V  111   0.50

Midplane usage:
---------------                       #nodes                                         
 +------------------------------------------+
 | R10-M1: <-->                           0 |
 | R10-M0: <-->                           0 |
 +------------------------------------------+    

If a partition hangs in this error state the problem can be solved by initializing a reallocation of the partition with:

    sched_bgl -b partition name
After 5 - 10 min the reservation system will reboot the partition and reallocate it for the user.
last change 07.04.2006 | Michael Stephan | Print