[Ekhi-users] Ekhi queue down due to high temperatura

Martin Gutierrez martin00gutierrez at gmail.com
Mon Oct 19 10:47:01 CEST 2020


Hi Iñigo, we identified some issues we the "free share" of the queue after discussing with the group. We think they are important and could be improved.

We would like to have the cluster usage history be taken into account.
Right now priority seems to be purely based on job size (in terms of core-RAM usage). This means that a big projects that can be divided into small pieces will be blocking the cluster for all the others that need a single big bulky job even if they are never using the cluster.

We would like that, for a single user, the jobs will be executed in the order they have been listed in the queue when possible.
Since the queue is prioritizing small jobs, an user that has to run long and short job will see the long jobs be blocked by the prioritized small jobs. So, when possible for a single user they should be ordered by the date they have been listed into queue.

Reserving resources when a job has been in top priority for a long amount of time.
Most of our jobs are less then one node. A two nodes the job would get stuck until the two nodes are freed. In our case the cluster is full most of the times meaning the job would remain stuck in queue for a long time even if prioritized.

Each user should be allowed to four nodes for their long jobs regardless of the amount of long jobs running already.
Althought we expected the cluster to receive mainly short jobs, right now some of us are dealing with bulky and expensive calculation all at the same time.
If someone is then using the four nodes, the rest of the users are forced to wait even if the cluster is free.

On Aug 21 2020, at 10:32 pm, Inigo Aldazabal Mensa <inigo.aldazabalm at ehu.eus> wrote:
> Hi all,
>
> It seems you are having some problems running jobs on ekhi. But some
> don't, as you can see with "squeue".
>
> In order to troubleshoot your problems please report:
> 1.- Directory you did "sbatch" from
> 2.- slurm.script used
> 3.- JobID
>
> ekhi11 had some problems and it should be fixed now.
>
> Iñigo
>
> On Wed, 19 Aug 2020
> 18:50:42 +0200 Inigo Aldazabal Mensa <inigo.aldazabalm at ehu.eus> wrote:
>
> > Hi,
> >
> > Again we had problems with the CFM Data Center 2 cooling system
> > that forced me to cancel all running jobs and shut down all Ekhi
> > computing nodes :-((
> >
> > Currently I'm working remotely most of the time (in fact, again, I
> > was on a free day today) but tomorrow I'll go to the CFM and talk to
> > the cooling system technicians to check this. Also, a new cooling
> > system to avoid this problems is on the way, but just a couple of
> > weeks of being installed.
> >
> > Depending on my talk with the technicians tomorrow I'll power back on
> > Ekhi computing nodes, but I can not say for sure. I'll keep you
> > informed.
> >
> > Bests (to say something),
> >
> > Iñigo
> >
> _______________________________________________
> Ekhi-users mailing list
> Ekhi-users at list.ehu.eus
> http://list.ehu.eus/mailman/listinfo/ekhi-users
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.ehu.eus/pipermail/ekhi-users/attachments/20201019/3ad3f026/attachment.html>


More information about the Ekhi-users mailing list