10/25/07 -------- Summary - fix for drms_server exit - updated nightly jsoc build script to work on the new tree - prepare to upgrade hmidb to PostgreSQL 8.2 - work in progress: Slony-I drms_server exit ================ Key fixes: - added drms_abort_now(), drms_disconnect_now(), and drms_send_commandcode_noecho(). - added signal block for SIGPIPE so that write() can fail without terminating the process. All fixes checked in. I am going to follow this up with closely watching idle transaction on our database machine. Currently there are two sources of idle transactions: - SUMS I described the cause of it a long time ago and it should be fixed. The fix is extremely simple: COMMIT a select statement. - drms_server. Don't fret if you get an email from me to describe how you run a particular module. This will help me track down any exit problem in drms_server. jsoc=# select * from pg_stat_activity where current_query ~* 'transaction'; datid | datname | procpid | usesysid | usename | current_query | query_start | backend_start | client_addr | client_port -----------+---------+---------+-----------+------------+-----------------------+-------------------------------+-------------------------------+---------------+------------- 117833083 | jsoc | 3753 | 117161306 | production | in transaction | 2007-10-25 10:32:27.922195-07 | 2007-10-24 16:31:23.662128-07 | 172.24.103.80 | 55318 117833083 | jsoc | 3758 | 117161306 | production | in transaction | 2007-10-25 10:32:27.915275-07 | 2007-10-24 16:33:05.688361-07 | 172.24.103.80 | 55384 117833083 | jsoc | 5830 | 117831133 | schou | in transaction | 2007-10-25 10:08:32.169667-07 | 2007-10-25 10:08:31.893354-07 | 172.24.103.32 | 42422 (3 rows) The first two are from SUMS. The third is a drms_server process created by show_keys. It was waiting for SUMS. show_keys ds hmi_ground.lev0[185000-185304,435851-436026] key FSN segfile -p -r -q ---------------------------------------------------------------------- Summary: stupendous exit routes in socket connnection For a given drms_server process, let client module processes connecting to it be c_0, c_1, c_2, etc. Their corresponding server threads, created by drms_server, s_0, s_1, s_2, etc. When a client module process c_i receives a signal, e.g., SIGINT, to exit, this signal is captured and c_i runs sighandler() which in turn calls drms_abort_now() that sends DRMS_DISCONNECT through the socket with status = 1, then exit. The server thread s_i could be doing a number of things. If it's idle, it reads the command from the socket, upon echoing the command through the socket, if c_i has already exited, the socket is no longer there, write() fails (because we have blocked SIGPIPE), s_i calls exit(), hence runs atexit_action(). If c_i is still there, the echoing succeeds, s_i calls Exit() which calls exit(). In either case, the exit procedure might be blocked in drms_server_abort() by another server thread s_j that currently holds a lock on the mutex. s_j may be processing any of the following commands DRMS_ROLLBACK, DRMS_COMMIT, DRMS_NEWSERIES, DRMS_DROPSERIES, pDRMS_NEWSLOTS, DRMS_SLOT_SETSTATE, DRMS_GETUNIT. When s_j releases the lock, the exit procedure proceeds in drms_server_abort() without any chance of furture blocks. In the case of self-start drms_server, since the client module is the parent process of the drms_server process, the signals received by the client module are passed on to the drms_server process. The signal thread receives these signals, and proceeds in calling Exit(). As the server thread may also reach drms_server_abort() due to command DRMS_DISCONNECT or broken pipe. Only one of them proceeds into drms_server_abort() due to the lock on the mutex. ---------------------------------------------------------------------- Summary: mundane exit route in direct connection There is only one process that creates a signal thread, and a sums thread when needed. Without logging, the process exits promptly without having to wait any outstand SUMS requests. With logging, drms_server_close_session() can not complete since the log SU needs to be committed, which can not happen until the current SUMS requests gets served. This is due of our head-of-line blocking queue design. The process can not exit until these two outstanding SUMS requests are served. ---------------------------------------------------------------------- Unresolved issue: program exits without proper SUM_close() - can't just detach the sums_thread since it depends on shared data struture in the mother thread that'd go away when the program exits. ---------------------------------------------------------------------- Prepare to upgrade hmidb to PostgreSQL 8.2 ========================================== Since there is internal data format change between major releases, data has to be backed up and restored. Looked into upgrade hmidb2 from 8.1 to 8.2. time consuming because size of the database, 858GB, compared to 10GB for hmidb:jsoc Options: 1) Data upgrade is supported only by exporting and importing data using pg_dump - slow and causes a long downtime - extra disk space needed - require both version (old&new) 2) pg_migrator - faster then 1) but it still has a long downtime when on disk structure has changed - previous version of PostgreSQL is needed. - no downgrade - require both version (old&new)