[Ekhi-users] Ekhi SLURM configuration

Thu Oct 29 09:56:17 CET 2020

On Thu, 29 Oct 2020 09:35:48 +0100
Ion Errea <ion.errea en ehu.eus> wrote:

> Hi Iñigo,
> 
> I agree with that change. So you can go on

Done, let's see.

I expect to check today the major changes in a test environment I'm
setting up for this. I'll report back about the implications on the
present setup.

Iñigo

.
> 
> Let’s see if that improves the performance.
> 
> Bests,
> 
> Ion Errea
> 
> Fisika Aplikatua 1 saila, Gipuzkoako Ingeniaritza Eskola, and
> Centro de Física de Materiales (CSIC-UPV/EHU),
> University of the Basque Country (UPV/EHU)
>          
> Manuel de Lardizabal 5, 20018 Donostia, 
> Basque Country, Spain
> 
> Tel:      +34 943 01 8417
> Email:  ion.errea en ehu.eus
> Web:    http://ionerrea.wordpress.com/
> 
> > On 29 Oct 2020, at 07:37, Inigo Aldazabal Mensa
> > <inigo.aldazabalm en ehu.eus> wrote:
> > 
> > Hi Antonella,
> > 
> > I'm testing the queue system changes as it has implications in the
> > present setup that I have to check out first. It is a major change
> > in the basics of the accounting and scheduling system, and I want to
> > coordinate with all of you in case the running jobs and present
> > accounts are affected because of it.
> > 
> > That being sayd, removing the higher priority for small jobs is an
> > independent, minor change  and, should Ion agree, can be done
> > straight on.
> > 
> > Bests,
> > 
> > Iñigo
> > 
> > 
> > 
> > 
> > On Wed, 28 Oct 2020 16:59:30 +0100
> > Antonella Meninno <ameninno001 en ikasle.ehu.es> wrote:
> >   
> >> Hi Iñigo,
> >> 
> >> I'd like to know if there has been any change to the queue system,
> >> and how it works now. In particular, I'd like to know if the
> >> absolute priority to small jobs has been removed.
> >> 
> >> I would say that giving absolute priority to small jobs presents an
> >> issue, the large jobs get stuck and never finish. I think that this
> >> should change, in order to give space to large jobs too.
> >> 
> >> Bests,
> >> 
> >> Antonella Meninno
> >> 
> >> On Mon, Oct 19, 2020 at 1:50 PM Inigo Aldazabal Mensa
> >> <inigo.aldazabalm en ehu.eus> wrote:
> >>   
> >>> Hi Martin,
> >>> 
> >>> (General note: please start a new thread for new topics, do not
> >>> follow old ones!)
> >>> 
> >>> Yes, I have to setup the queue system so that history is taken
> >>> into account. I'll put this up in my todo list and try to do it
> >>> along this week.
> >>> 
> >>> Technically I have to setup the "Slurm Accounting Database" which
> >>> will allow us to use the all the features of the "Multifactor
> >>> Prority Plugin", and also "Quality of Service" limits (see below)
> >>> 
> >>> https://slurm.schedmd.com/priority_multifactor.html
> >>> 
> >>> More below...
> >>> 
> >>> On Mon, 19 Oct 2020 10:47:01
> >>> +0200 Martin Gutierrez <martin00gutierrez en gmail.com> wrote:
> >>>   
> >>>> Hi Iñigo, we identified some issues we the "free share" of the
> >>>> queue after discussing with the group. We think they are
> >>>> important and could be improved.
> >>>> 
> >>>> We would like to have the cluster usage history be taken into
> >>>> account. Right now priority seems to be purely based on job size
> >>>> (in terms of core-RAM usage). This means that a big projects that
> >>>> can be divided into small pieces will be blocking the cluster for
> >>>> all the others that need a single big bulky job even if they are
> >>>> never using the cluster.    
> >>> 
> >>> Yes, now small jobs are prioritized (option
> >>> PriorityFavorSmall=Yes)
> >>> 
> >>> https://slurm.schedmd.com/priority_multifactor.html#jobsize
> >>>   
> >>>> We would like that, for a single user, the jobs will be executed
> >>>> in the order they have been listed in the queue when possible.
> >>>> Since the queue is prioritizing small jobs, an user that has to
> >>>> run long and short job will see the long jobs be blocked by the
> >>>> prioritized small jobs. So, when possible for a single user they
> >>>> should be ordered by the date they have been listed into
> >>>> queue.    
> >>> 
> >>> Strict ordering I see very difficult to achieve, but *not*
> >>> prioritizing small jobs is easy and just a matter of changing the
> >>> previous option.   
> >>>> Reserving resources when a job has been in top priority for a
> >>>> long amount of time. Most of our jobs are less then one node. A
> >>>> two nodes the job would get stuck until the two nodes are freed.
> >>>> In our case the cluster is full most of the times meaning the job
> >>>> would remain stuck in queue for a long time even if
> >>>> prioritized.    
> >>> 
> >>> Slurm should be clever enough to leave an "empty" node waiting in
> >>> order to fit a two node job. This should work mostly ok. Again not
> >>> giving higher priority to small jobs can improve this.
> >>>   
> >>>> Each user should be allowed to four nodes for their long jobs
> >>>> regardless of the amount of long jobs running already. Althought
> >>>> we expected the cluster to receive mainly short jobs, right now
> >>>> some of us are dealing with bulky and expensive calculation all
> >>>> at the same time. If someone is then using the four nodes, the
> >>>> rest of the users are forced to wait even if the cluster is
> >>>> free.    
> >>> 
> >>> This kind of user limits can be applied (I have still to check
> >>> exactly the level of detail that can be specified) using what in
> >>> Slurm is called Quality of Service (QOS), but for this first I
> >>> have to set up the Slurm Accounting Database.
> >>> 
> >>> https://slurm.schedmd.com/qos.html#limits
> >>> 
> >>> 
> >>> Should you be interested you can take a look at ekhi slurm
> >>> configuration file /etc/slurm/slurm.conf
> >>> 
> >>> In any case I'll talk to Ion about this ideas and come back with a
> >>> better defined plan.
> >>> 
> >>> Bests,
> >>> 
> >>> Iñigo
> >>> 
> >>>   
> >>>> On Aug 21 2020, at 10:32 pm, Inigo Aldazabal Mensa
> >>>> <inigo.aldazabalm en ehu.eus> wrote:    
> >>>>> Hi all,
> >>>>> 
> >>>>> It seems you are having some problems running jobs on ekhi. But
> >>>>> some don't, as you can see with "squeue".
> >>>>> 
> >>>>> In order to troubleshoot your problems please report:
> >>>>> 1.- Directory you did "sbatch" from
> >>>>> 2.- slurm.script used
> >>>>> 3.- JobID
> >>>>> 
> >>>>> ekhi11 had some problems and it should be fixed now.
> >>>>> 
> >>>>> Iñigo
> >>>>> 
> >>>>> On Wed, 19 Aug 2020
> >>>>> 18:50:42 +0200 Inigo Aldazabal Mensa <inigo.aldazabalm en ehu.eus>
> >>>>> wrote:    
> >>>>>> Hi,
> >>>>>> 
> >>>>>> Again we had problems with the CFM Data Center 2 cooling
> >>>>>> system that forced me to cancel all running jobs and shut
> >>>>>> down all Ekhi computing nodes :-((
> >>>>>> 
> >>>>>> Currently I'm working remotely most of the time (in fact,
> >>>>>> again, I was on a free day today) but tomorrow I'll go to the
> >>>>>> CFM and talk to the cooling system technicians to check this.
> >>>>>> Also, a new cooling system to avoid this problems is on the
> >>>>>> way, but just a couple of weeks of being installed.
> >>>>>> 
> >>>>>> Depending on my talk with the technicians tomorrow I'll power
> >>>>>> back on Ekhi computing nodes, but I can not say for sure. I'll
> >>>>>> keep you informed.
> >>>>>> 
> >>>>>> Bests (to say something),
> >>>>>> 
> >>>>>> Iñigo
> >>>>>>   
> >>>>> _______________________________________________
> >>>>> Ekhi-users mailing list
> >>>>> Ekhi-users en list.ehu.eus
> >>>>> http://list.ehu.eus/mailman/listinfo/ekhi-users
> >>>>>   
> >>>>   
> >>> _______________________________________________
> >>> Ekhi-users mailing list
> >>> Ekhi-users en list.ehu.eus
> >>> http://list.ehu.eus/mailman/listinfo/ekhi-users
> >>>   
> > _______________________________________________
> > Ekhi-users mailing list
> > Ekhi-users en list.ehu.eus
> > http://list.ehu.eus/mailman/listinfo/ekhi-users  
>