Locked History Actions

OPS/FAQs/Redundancy_guidelines_for_grid_services

This page aims to provide some guidelines on how to implement redundancy mechanisms in some grid services. This page will be considered as "work in progress" since new solutions can come up at any time.

Clients Redundancy

Information Systems

In gLite 3.2 you can set up redundancy of Top-BDIIs for the clients (glite-WNs and gLite-UIs) configuring the gLite-WNs/UIs with the following definitions in YAIM site-info.conf configuration file:

  • BDII_LIST: Optional variable to define a list of top level BDIIs to support the automatic failover in the GFAL clients. The syntax is my-bdii1.$MY_DOMAIN:port1[,my-bdii22.$MY_DOMAIN:port2[...]]. If the first Top level BDII fails to be contacted, the second will be used in its place, and so on...

YAIM will implement the LCG_GFAL_INFOSYS environment variable in the gLite-WNs/UIs /etc/profile.d/grid-env.sh files, as in the following example:

root@wn001 ~# grep GFAL /etc/profile.d/grid-env.sh
gridenv_set         "LCG_GFAL_INFOSYS" "topbdii.core.ibergrid.eu:2170,topbdii01.ncg.ingrid.pt:2170,gridii01.ifca.es:2170,bdii.pic.es:2170"

Data Management

  1. lcg_utils: The data management tools (lcg_utils) contact the information system for every operation (lcg-cr, lcg-cp, ...). So, if you have your client properly configured with redundancy for the information system (see previous point), the lcg_utils tools will use that mechanism in a transparent way. Be aware that lcg-infosites doesn't work with multiple BDIIs. Only gfal, lcg_utils, lcg-info and glite-sd-query.
  2. LFC: By arquitecture, a VO must have a single LFC server available in read-write mode but there could be more configured in read-only mode and syncronized with the main server. Unfortunately, there is no automatic redundancy mechanism to switch between LFCs, and the user has to choose which one he would like to use setting the proper LFC_HOST environment variable at run time. Please keep in mind that no new data can be registered if the main LFC fails, and the user chooses to use the read-only LFC.

WMS Job submission

You can specify a space separated list of WMSs hostnames per VO, at configuration time, using

  • VO_"vo-name"_WMS_HOSTS: Optional variable to specify a space separated list of WMSs hostname supported by the VO.

If the first WMS fails to be contacted, the second declared WMS will be used in its place, and so on...

Issuing Proxies

If a VO has more than one VOMS server available, the client should be properly configured with the whole set of VOMS services. If the main server is not available when a user issues the voms-proxy-init --voms "vo-name" command, the secondary servers are tried in alternative.

To implement this mechanism, the clients should use the following settings in YAIM site-info.conf configuration file:

  • VO_"vo-name"_VOMSES: This variable contains the vomses file parameters needed to contact a VOMS server. Multiple VOMS servers can be given if the parameters are enclosed in single quotes. The syntax should be 'vo_nickname voms_server_hostname port voms_server_host_cert_dn vo_name gt_version' where gt_version is optional and it refers to the version of Globus Toolkit the VOMS sever is running. This argument is needed to know how to contact the VOMS server, which is done in a different way depending on the GT version it's running.

  • VO_"vo-name"_VOMS_SERVERS: A list of the VOMS servers used to create the DN grid-map file. The format is

'vomss://"host-name":8443/voms/"vo-name"?/"vo-name"'
  • VO_"vo-name"_VOMS_CA_DN: DN of the CA that signs the VOMS server certificate. Multiple values can be given if enclosed in single quotes. Note that there must be as many entries as in the VO_"vo-name"_VOMSES variable. There's a one to one relationship in the elements of both lists, so the order must be respected.

Storing Proxies

The clients use a PX server definition which is set up by YAIM configuration variable PX_HOST, and accessed at runtime via the MYPROXY_SERVER environment variable. If the default PX server is not available to store the user crendencials, the user can choose a different PX server, changing the MYPROXY_SERVER environment variable at run time. In this sense, there is no automatic redundancy for this service, and the user will have to choose manually which PX server to use.

If the user would like to execute very long jobs (see IBERGRID Proxy Renewal guidelines), the WMS will have to be able to renew the user proxy. Therefore, the user should tell the WMS which PX server he is using. By default, the WMS will try to contact the PX server which was initially configured in the PX_HOST YAIM configuration variable. There is also the possibility to set up a different PX server per VO, using:

  • VO_"vo-name"_PX_HOST: Myproxy server supported by the VO.

However, if the user changes the default definitions re-setting the MYPROXY_SERVER environment variable, the WMS will not be aware of that change, and the user will have to force that change including MyProxyServer="proxy_server" in his JDL definition.

Core services redundancy mechanisms

VOMS and LFC

Since VOMS and LFC servers are considered single points of failure in the infrastructure, it is common practice that VOs setup one main server (configured in read-write mode) and secondary servers configured in read-only mode. The secondary servers syncronizes via MySQL with the main server, and allows users to continue working in case the main server is out of contact. Some guidelines to set up the MySQL replication mechanism are available in IBERGRID MySQL Replication Mechanism.

The clients should be automatically configured to use all available VOs VOMS servers, with the main servers being the first contacted machine. In the LFC case, it is the user responsability to switch (manually) between servers (see previous item on "Clients Redundancy").

PX

If PX server is lost during the execution of a very long job, there is no alternative way to renew user proxies, and the job will fail with the "Proxy expired" message.


Index


No Pages to show