[Ekhi-users] Ekhi SLURM configuration

Wed Oct 28 16:59:30 CET 2020

Hi Iñigo,

I'd like to know if there has been any change to the queue system, and how
it works now. In particular, I'd like to know if the absolute priority to
small jobs has been removed.

I would say that giving absolute priority to small jobs presents an issue,
the large jobs get stuck and never finish. I think that this should change,
in order to give space to large jobs too.

Bests,

Antonella Meninno

On Mon, Oct 19, 2020 at 1:50 PM Inigo Aldazabal Mensa
<inigo.aldazabalm en ehu.eus> wrote:

> Hi Martin,
>
> (General note: please start a new thread for new topics, do not
> follow old ones!)
>
> Yes, I have to setup the queue system so that history is taken into
> account. I'll put this up in my todo list and try to do it along this
> week.
>
> Technically I have to setup the "Slurm Accounting Database" which will
> allow us to use the all the features of the "Multifactor Prority
> Plugin", and also "Quality of Service" limits (see below)
>
> https://slurm.schedmd.com/priority_multifactor.html
>
> More below...
>
> On Mon, 19 Oct 2020 10:47:01
> +0200 Martin Gutierrez <martin00gutierrez en gmail.com> wrote:
>
> > Hi Iñigo, we identified some issues we the "free share" of the queue
> > after discussing with the group. We think they are important and
> > could be improved.
> >
> > We would like to have the cluster usage history be taken into account.
> > Right now priority seems to be purely based on job size (in terms of
> > core-RAM usage). This means that a big projects that can be divided
> > into small pieces will be blocking the cluster for all the others
> > that need a single big bulky job even if they are never using the
> > cluster.
>
> Yes, now small jobs are prioritized (option PriorityFavorSmall=Yes)
>
> https://slurm.schedmd.com/priority_multifactor.html#jobsize
>
> > We would like that, for a single user, the jobs will be executed in
> > the order they have been listed in the queue when possible. Since the
> > queue is prioritizing small jobs, an user that has to run long and
> > short job will see the long jobs be blocked by the prioritized small
> > jobs. So, when possible for a single user they should be ordered by
> > the date they have been listed into queue.
>
> Strict ordering I see very difficult to achieve, but *not* prioritizing
> small jobs is easy and just a matter of changing the previous option.
>
> > Reserving resources when a job has been in top priority for a long
> > amount of time. Most of our jobs are less then one node. A two nodes
> > the job would get stuck until the two nodes are freed. In our case
> > the cluster is full most of the times meaning the job would remain
> > stuck in queue for a long time even if prioritized.
>
> Slurm should be clever enough to leave an "empty" node waiting in order
> to fit a two node job. This should work mostly ok. Again not giving
> higher priority to small jobs can improve this.
>
> > Each user should be allowed to four nodes for their long jobs
> > regardless of the amount of long jobs running already. Althought we
> > expected the cluster to receive mainly short jobs, right now some of
> > us are dealing with bulky and expensive calculation all at the same
> > time. If someone is then using the four nodes, the rest of the users
> > are forced to wait even if the cluster is free.
>
> This kind of user limits can be applied (I have still to check exactly
> the level of detail that can be specified) using what in Slurm is
> called Quality of Service (QOS), but for this first I have to set up the
> Slurm Accounting Database.
>
> https://slurm.schedmd.com/qos.html#limits
>
>
> Should you be interested you can take a look at ekhi slurm configuration
> file /etc/slurm/slurm.conf
>
> In any case I'll talk to Ion about this ideas and come back with a
> better defined plan.
>
> Bests,
>
> Iñigo
>
>
> > On Aug 21 2020, at 10:32 pm, Inigo Aldazabal Mensa
> > <inigo.aldazabalm en ehu.eus> wrote:
> > > Hi all,
> > >
> > > It seems you are having some problems running jobs on ekhi. But some
> > > don't, as you can see with "squeue".
> > >
> > > In order to troubleshoot your problems please report:
> > > 1.- Directory you did "sbatch" from
> > > 2.- slurm.script used
> > > 3.- JobID
> > >
> > > ekhi11 had some problems and it should be fixed now.
> > >
> > > Iñigo
> > >
> > > On Wed, 19 Aug 2020
> > > 18:50:42 +0200 Inigo Aldazabal Mensa <inigo.aldazabalm en ehu.eus>
> > > wrote:
> > > > Hi,
> > > >
> > > > Again we had problems with the CFM Data Center 2 cooling system
> > > > that forced me to cancel all running jobs and shut down all Ekhi
> > > > computing nodes :-((
> > > >
> > > > Currently I'm working remotely most of the time (in fact, again, I
> > > > was on a free day today) but tomorrow I'll go to the CFM and talk
> > > > to the cooling system technicians to check this. Also, a new
> > > > cooling system to avoid this problems is on the way, but just a
> > > > couple of weeks of being installed.
> > > >
> > > > Depending on my talk with the technicians tomorrow I'll power
> > > > back on Ekhi computing nodes, but I can not say for sure. I'll
> > > > keep you informed.
> > > >
> > > > Bests (to say something),
> > > >
> > > > Iñigo
> > > >
> > > _______________________________________________
> > > Ekhi-users mailing list
> > > Ekhi-users en list.ehu.eus
> > > http://list.ehu.eus/mailman/listinfo/ekhi-users
> > >
> >
> _______________________________________________
> Ekhi-users mailing list
> Ekhi-users en list.ehu.eus
> http://list.ehu.eus/mailman/listinfo/ekhi-users
>
------------ pr�xima parte ------------
Se ha borrado un adjunto en formato HTML...
URL: <http://list.ehu.eus/pipermail/ekhi-users/attachments/20201028/25cbaf9b/attachment.html>