[Ekhi-users] Ekhi SLURM configuration

Inigo Aldazabal Mensa inigo.aldazabalm at ehu.eus
Thu Oct 29 07:37:21 CET 2020


Hi Antonella,

I'm testing the queue system changes as it has implications in the
present setup that I have to check out first. It is a major change in
the basics of the accounting and scheduling system, and I want to
coordinate with all of you in case the running jobs and present
accounts are affected because of it.

That being sayd, removing the higher priority for small jobs is an
independent, minor change  and, should Ion agree, can be done straight
on.

Bests,

Iñigo




On Wed, 28 Oct 2020 16:59:30 +0100
Antonella Meninno <ameninno001 en ikasle.ehu.es> wrote:

> Hi Iñigo,
> 
> I'd like to know if there has been any change to the queue system,
> and how it works now. In particular, I'd like to know if the absolute
> priority to small jobs has been removed.
> 
> I would say that giving absolute priority to small jobs presents an
> issue, the large jobs get stuck and never finish. I think that this
> should change, in order to give space to large jobs too.
> 
> Bests,
> 
> Antonella Meninno
> 
> On Mon, Oct 19, 2020 at 1:50 PM Inigo Aldazabal Mensa
> <inigo.aldazabalm en ehu.eus> wrote:
> 
> > Hi Martin,
> >
> > (General note: please start a new thread for new topics, do not
> > follow old ones!)
> >
> > Yes, I have to setup the queue system so that history is taken into
> > account. I'll put this up in my todo list and try to do it along
> > this week.
> >
> > Technically I have to setup the "Slurm Accounting Database" which
> > will allow us to use the all the features of the "Multifactor
> > Prority Plugin", and also "Quality of Service" limits (see below)
> >
> > https://slurm.schedmd.com/priority_multifactor.html
> >
> > More below...
> >
> > On Mon, 19 Oct 2020 10:47:01
> > +0200 Martin Gutierrez <martin00gutierrez en gmail.com> wrote:
> >  
> > > Hi Iñigo, we identified some issues we the "free share" of the
> > > queue after discussing with the group. We think they are
> > > important and could be improved.
> > >
> > > We would like to have the cluster usage history be taken into
> > > account. Right now priority seems to be purely based on job size
> > > (in terms of core-RAM usage). This means that a big projects that
> > > can be divided into small pieces will be blocking the cluster for
> > > all the others that need a single big bulky job even if they are
> > > never using the cluster.  
> >
> > Yes, now small jobs are prioritized (option PriorityFavorSmall=Yes)
> >
> > https://slurm.schedmd.com/priority_multifactor.html#jobsize
> >  
> > > We would like that, for a single user, the jobs will be executed
> > > in the order they have been listed in the queue when possible.
> > > Since the queue is prioritizing small jobs, an user that has to
> > > run long and short job will see the long jobs be blocked by the
> > > prioritized small jobs. So, when possible for a single user they
> > > should be ordered by the date they have been listed into queue.  
> >
> > Strict ordering I see very difficult to achieve, but *not*
> > prioritizing small jobs is easy and just a matter of changing the
> > previous option. 
> > > Reserving resources when a job has been in top priority for a long
> > > amount of time. Most of our jobs are less then one node. A two
> > > nodes the job would get stuck until the two nodes are freed. In
> > > our case the cluster is full most of the times meaning the job
> > > would remain stuck in queue for a long time even if prioritized.  
> >
> > Slurm should be clever enough to leave an "empty" node waiting in
> > order to fit a two node job. This should work mostly ok. Again not
> > giving higher priority to small jobs can improve this.
> >  
> > > Each user should be allowed to four nodes for their long jobs
> > > regardless of the amount of long jobs running already. Althought
> > > we expected the cluster to receive mainly short jobs, right now
> > > some of us are dealing with bulky and expensive calculation all
> > > at the same time. If someone is then using the four nodes, the
> > > rest of the users are forced to wait even if the cluster is
> > > free.  
> >
> > This kind of user limits can be applied (I have still to check
> > exactly the level of detail that can be specified) using what in
> > Slurm is called Quality of Service (QOS), but for this first I have
> > to set up the Slurm Accounting Database.
> >
> > https://slurm.schedmd.com/qos.html#limits
> >
> >
> > Should you be interested you can take a look at ekhi slurm
> > configuration file /etc/slurm/slurm.conf
> >
> > In any case I'll talk to Ion about this ideas and come back with a
> > better defined plan.
> >
> > Bests,
> >
> > Iñigo
> >
> >  
> > > On Aug 21 2020, at 10:32 pm, Inigo Aldazabal Mensa
> > > <inigo.aldazabalm en ehu.eus> wrote:  
> > > > Hi all,
> > > >
> > > > It seems you are having some problems running jobs on ekhi. But
> > > > some don't, as you can see with "squeue".
> > > >
> > > > In order to troubleshoot your problems please report:
> > > > 1.- Directory you did "sbatch" from
> > > > 2.- slurm.script used
> > > > 3.- JobID
> > > >
> > > > ekhi11 had some problems and it should be fixed now.
> > > >
> > > > Iñigo
> > > >
> > > > On Wed, 19 Aug 2020
> > > > 18:50:42 +0200 Inigo Aldazabal Mensa <inigo.aldazabalm en ehu.eus>
> > > > wrote:  
> > > > > Hi,
> > > > >
> > > > > Again we had problems with the CFM Data Center 2 cooling
> > > > > system that forced me to cancel all running jobs and shut
> > > > > down all Ekhi computing nodes :-((
> > > > >
> > > > > Currently I'm working remotely most of the time (in fact,
> > > > > again, I was on a free day today) but tomorrow I'll go to the
> > > > > CFM and talk to the cooling system technicians to check this.
> > > > > Also, a new cooling system to avoid this problems is on the
> > > > > way, but just a couple of weeks of being installed.
> > > > >
> > > > > Depending on my talk with the technicians tomorrow I'll power
> > > > > back on Ekhi computing nodes, but I can not say for sure. I'll
> > > > > keep you informed.
> > > > >
> > > > > Bests (to say something),
> > > > >
> > > > > Iñigo
> > > > >  
> > > > _______________________________________________
> > > > Ekhi-users mailing list
> > > > Ekhi-users en list.ehu.eus
> > > > http://list.ehu.eus/mailman/listinfo/ekhi-users
> > > >  
> > >  
> > _______________________________________________
> > Ekhi-users mailing list
> > Ekhi-users en list.ehu.eus
> > http://list.ehu.eus/mailman/listinfo/ekhi-users
> >  


More information about the Ekhi-users mailing list