SGE support interactive jobs with qsh, qrsh and qlogin. While qsh focus with X11 sessions, qrsh and qlogni are using to execute interactive commands (qlogin is more used to initiate interactive shell, but qrsh do the command and exit).
Tag Archives: Cluster
Dell cluster headnode down
This is the first time when Dell cluster gets down after I took over the Unix/Linux side. At about 10:20, I got a call from colleague that the cluster head node is not accessible and even ping does not receive any response. Consequently, the entire cluster is not available temporarily to the business.
How does Ganglia monitor and collects nodes info
Ganglia is one of the most widely used open source cluster monitor tools. It is documentation page has detailed information on how to configure gmond and gmetad, but lacks an architectual overview on how Ganglia collects modes information for each monitored node.
Fixed the problem where cexec in C3 hangs
We use C3 to submit same command to be executed on all the nodes. This week I found the “cexec” command does not work anymore. It simply hangs after command and timesout after very long.
Again, to work with machines
As the ex-unix-admin left the company, I need to take the responsibility to take care of these unix/linux machines. While the company expected me to do more high-level things than playing with physical machine/environment, I still need to do some basic work while there is nobody to do it.
Clarified a few confusion on LAM-MPI
I had a few wrong understandings of LAM-MPI and get clarified today when working out the cluster project. Referring back to my last post on MPI, the conclude was made wrong. The daemon I started can only be used by myself. Every user have to start their own lamd before running program with R-SNOW or Rmpi package.
Rmpi/SNOW runs job only at headnode, fixed
About puzzlebird
So, here is something about me.
Expertises:
HPC Architecutre, Performance tuning (system/program profile),
Linux/Unix, Mac osX, Oracle, MySQL, Perl, C, SHELL, PHP, XML, Joomla
Prototype code to run batch jobs with SGE
User once ran thousands of “agrep” jobs on our 20 CPU Solaris machine and took all the CPUs. I am leading the process to introduce cluster computing to the bioinformatics group, so this is a good chance to convert this scripts to run in a SGE cluster environment.
Fix “NIS account” issue using nscd
We encounted a problem with current cluster that about 5-8% percent of jobs fail with error “can not find password entry for user ‘xxx’. User may not exists or NIS error”. This kind of error happens randomly to the submitted jobs and only affects NIS users (the local account works perfectly well).