AAASwitch_UZH1_Campus


UZH.1 roadmap: update 27.05.2010

  1. Provide login node with SLES10 (ID)

    • qsub to Schroedinger

    • same UIDs/users as on Schroedinger

    • spool/scratch partition shared with all Schroedinger compute nodes

    • firewall req's: open ports from UZH (possibly the world?)

    • 2135/tcp

    • 2811/tcp

    • 30000--35000/tcp

    • MEETING on Tuesday June 8th at 14:00 (ID: Christian; GC3: Mike + Riccardo)

  2. prepare filesystem layout (ID + GC3)

    • sessiondir (spool directory for jobs; also temporary space for trasfering files to/from the client)

    • runtimedir (hosts application configuration / env. variables)

    • cachedir (caches repeated transfer of the same file from a client; useful e.g. for BioInf databases or HEP analysis data)

  3. set up special queue (subset of Schroedinger nodes: 1 rack; 24 hours duration)

  4. request hostcert (GC3)

    • need to know final hostname: idesl4.uzh.ch
  5. compile/package ARC sw on SLES10 (GC3)

  6. Applications used for testing:

    • GAMESS
  7. User setup:

    • GC3 cluster uses Schroedinger users

    • Only Schroedinger users will be allowed into Schroedinger via ARC (at least initially), and they will be mapped to their UNIX account.

    • GC3 to act as single point of contact for enabling SLCS users

    • discuss with Luzian if GC3 can be delegated the power to enable SLCS

  8. (eventually) enable Nagios sensors on idesl4.uzh.ch

Followup meeting 19.02.2010

  • Participants: Christian (CB), vincent (VK), Alexander (AG), Riccardo (RM), Mike (MP), Sergio (SM)

Current status of UZH.1 project

  • Campus grid cannot reach stated goals.

  • What is realistic to do:

    • have the gc3 cluster run the small non-HPC jobs that would otherwise land on Schroedinger

    • allow seamless submission of job to either cluster: users should not care where the job lands

    • Heave the solution integrated with SMSCG

  • We could request an extension to switch: no problem during the calendar year

    • We will try to complete it on time, ask extension only if needed

Preconditions on Schroedinger

  • plan for a solution that is independent of Schroedinger

  • i.e., don't touch the existing SCH setup

Approach to follow

  • Make sure SGE on idgc3 can submit to the SGE master on SCH

    • users on idgc3 need to come from the SCH LDAP

    • then users which are local to idgc3 will not be accepted by the SCH SGE

  • applications should sit in similar paths on GC3 and SCH

    • this is a constraint coming from the Grid middleware

    • We have to verify that this is indeed a constrain or if there could be a workaround

  • Lustre availability

    • Lustre will not be available on the GC3 cluster

    • so users will have to explicitly request Lustre in submission scripts

    • or better specify "allow grid" in the submission, in case they don't need SCH

  • Scratch dir: where is it located?

    • application-specific: each application can have its own set of environment variables

    • like to have a generic $SCRATCH env. var. pointing to a generic scratch area

    • grid apps right now do not need much scratch -- local scratch will do

    • no local scratch on SCH, need to use Lustre

    • better name it $GRID_SCRATCH so policies for it can differ from the local jobs policies

  • Accounting

    • Info on the database includes the "cell" (cluster) name, so no problem in sending all acct data to the same DB (assuming no SGE uses the default cluster name of "default")
  • Issue with architecture/system specific configuration

    • should users put "if uname == ..." everywhere in their scripts?

    • no, env. vars will be defined, use those: e.g., call "$APPS/games" instead of "/whatever/path/gamess"

    • have a predefined set of installed applications

    • one responsible person for each applications, writes an HOWTO that other sysadmins will use for installing the application on SCH and GC3

    • users can only use the installed applications -- no submission of arbitrary binaries.

  • Application deployment: provide two sets of the same binaries

    • one optimized for SCH

    • one safe and sound for Grid usage

  • what about the condor pool?

    • postpone until next semester (summer pause)
  • SCH has two SGE masters (master1 & master2)

    • not yet connected to IP network (but planned)

    • but they only have to talk to the SGE master on GC3,

    • so can only open a hole in the firewall for that one only IP

Preliminary plan

  • Hooking SGE/SCH into SGE/GC3: MP + CB

  • users on LDAP on GC3: MP

  • create "Grid" realm on SGE (for external Grid users): ID

  • layout for the panasas FS: ID + SM

  • GAMESS install: MP

  • Shared filesystem between the two systems ( how to mount Panasas on "idgc3"? )

    • only frontend node on "idgc3" has public IP

    • mount Panasas on FE node only and stage from there?

    • need to think about it; several possible solutions:

    • put the two clusters in the same mgmt subnet (don't like because of security implications)

    • re-export via NFS and have nodes on "idgc3" use public IP

    • stage everything (sounds like the best one)

    • stage should be enough for the GAMESS use cases

  • UNIX groups are written in stone in LDAP

    • each user has one and only one UNIX group, which is its home institution

    • use "projects" in SGE to allow submission

    • problem is with the Grid MW: need a shared FS, usually group-writable; if users don't belong all to the same group, need to make it world-writable.

    • maybe make a local group to the machine

    • UPDATE 2010-04-08: Since we are creating gridXXX generic grid user accounts, we shall have a grid group, and every UZH user that uses ARC job submission will be a member of this group. (This allows us to keep the ARC sessiondir directory with 770 permissions.)

Technical Solutions

SGE File Staging

Apparently SGE does not support file staging as in PBS where a specific directive is provided ( "-w" ). According to the documentation found at: http://gridengine.sunsource.net/project/gridengine/howto/filestaging/, there are two possible scenario:

  • DRMAA - but in tis case, the application to be submitted needs to be DRMAA aware (not really feasible for out needs)

  • Trick SGE with -v (set environmental variables) and prologue/epilogue scripts

For the moment we choose the latter variant and we tested on idgc3grid01 cluster.

  • The prologue/epilogue scripts will be executed on the compute node (need to make sure the path will be identical on both systems)

  • Scripts will be executed on behalf of the user (using user's id)

  • Prologue and Epilogue will be activated when SGE_IN and/or SGE_OUT variables will be set in the job submittion

UZH_SGE_IN UZH_SGE_OUT syntax

The UZH_SGE_IN/UZH_SGE_OUT syntax mimicks TORQUE syntax for file stage-in/out, see http://www.clusterresources.com/torquedocs21/6.3filestaging.shtml

Each of the UZH_SGE_IN and UZH_SGE_OUT environment variables is a comma- separated list of copy-specs. Each copy-spec is a pair of exec-node-path and frontend-node-path, separated by the character @. The exec-node- path is an absolute pathname (e.g., "/tmp/session/123456"). The frontend- node-path is composed of a hostname, a colon character :, and an absolute pathname (e.g., "idgc3grid01:/export/grid/session/1233456").

The UZH_SGE_IN and UZH_SGE_OUT are translated into a series of "rsync" invocations: each copy-spec of the form X@Y triggers the execution of "rsync -a Y X" (on stage-in) or "rsync -a X Y" (on stage-out). In particular, "rsync" conventions on recursive directory copies apply.

All pathnames must be absolute; the effect of passing a relative pathname in the exec-node-path or frontend-node-path is unspecified.

How to install the prologue/epilogue scripts

# qconf -sconf | grep -e prolog -e epilog

prolog                       /share/apps/bin/file_trans_prolog.sh

epilog                       /share/apps/bin/file_trans_epilog.sh

Sample prologue/epilogue scripts

cat /share/apps/bin/file_trans_prolog.sh

#!/usr/bin/env perl


# prologue script to stage files in to the exec node file system


# do nothing if UZH_SGE_IN is not defined or empty

exit 0 if not $ENV{'UZH_SGE_IN'};


foreach my $copyspec (split(/,/, $ENV{'UZH_SGE_IN'})) {

  my ($local, $remote) = split(/@/, $copyspec);

  `rsync -a $remote $local`;

}

cat /share/apps/bin/file_trans_epilog.sh

#!/usr/bin/env perl


# prologue script to stage files in to the exec node file system


# do nothing if UZH_SGE_OUT is not defined or empty

exit 0 if not $ENV{'UZH_SGE_OUT'};


foreach my $copyspec (split(/,/, $ENV{'UZH_SGE_OUT'})) {

  my ($local, $remote) = split(/@/, $copyspec);

  `rsync -a $local $remote`;

}

Architecture

[UZH.1 campus architecture 20100219.eps][3]

[3]: UZH.1_campus_architecture_20100219.eps (UZH.1 campus architecture 20100219.eps)

top