01/17/08 Summary - Fixed stdout reopen problem - Fixed a bunch of long standing compiler warnings: getpeername(), connect(), etc. - Did more testing on begin/end_transaction() - SUMS caveats - DRMS/SUMS garbage collection - Started on writing DRMS code for remote SUMS - Ran tests on -static flag for pthreads Fixed stdout reopen problem --------------------------- rewrote restore_stdeo() function changed from fclose(stdout) to close(STDOUT_FILENO). More testing on begin/end_transaction() --------------------------------------- Just to make sure I haven't broken anything by shuffling things around. One limitation with begin/end_transaction(): begin/end_transaction() are server-side function. If a module uses them, it can only be made in direct connect version. This means irregularity in makefiles. still not committed due to pending SUMS issues. SUMS caveats ------------ I ran into a couple of problems with SUMS while testing DRMS transaction restart code which is needed for lev0 code. Memory leak: Valgrind, the memory checker, complains about memory leaks which I can't locate. The case is very simple: in the same program, if one repeatedly calls SUM_open and SUM_close, and something in between, SUM_get for example, the memory is not freed at the end of SUM_close. So one observes the free memory depleting. I wrote a small program at ~karen/src/karen_test/sum_test3.c. The problem remains with or without threads (we use threads in DRMS, that's why I need to test it). The memory leak is alarming. I saw about at least 8K per open/close pair. It would render the transaction restart model unusable if we don't address it. Alternatively, I can avoid starting a new SUM object for every transaction, i.e., keep using the same SUM object. This also has a side-effect: all SU accessed by a SUM object are protected somehow, i.e., can't be removed from the disk until the SUM object is closed. If we run this type of module for a long time, lev0 processing for example, we are potentially clogging up disk storage space. Resource limit? Another related problem is that I can't run SUM_open and SUM_close pair closely together, I got "__get_myaddress: socket: Too many open files" error. In my test program, I have to add sleep(1) at the end of each open/close pair. It seems to me some resource is held up somewhere. DRMS/SUMS garbage collection ---------------------------- Revisit drms/sums garbage collection issue from jun 25, 2007 Got feedback from Rick. Need clarification. DRMS Records - Permanent (permanently stored in the DRMS database) - Transient (exist within a session, go away at the end of a session, deleted from the DRMS database at the end of a session) Both are specified in drms_create_records() SU (I am not sure about the terminology. Jim, please correct me.) - Permanent (archive = 1 in series definition or -A option for drms_server) - Temporary (archive = 0 in series definition and no -A option for drms_server): they will be removed from the disk after retention time runs out. We have four combinations of the above: 1. Permanent record, Permanent SU 1. Permanent record, Permanent SU 2. Permanent record, Temporary SU 3. Transient record, Permanent SU 4. Transient record, Temporary SU I think Jim is considering case 2 where upon removing the temporrary SU, the permanent records in DRMS database is also removed. I recall from our meetings, people suggested such segment-less records may still be of value... I generally feel uncomfortable with SUMS removing DRMS records and vice versa in such implicit manner. Cases 3) and 4) involve cleaning up of the SUs. In addition, delete_series also creates garbage SU. Cleanup for this latter case and Case 3) are not yet implemented. Remote SUMS ----------- need way to assign higher bits in sunum from SUMS. trouble with drms_su_getsudir() and drms_su_getsudirs(). The later may have part of the record sets in local SUMS. Ran tests on -static flag for pthreads -------------------------------------- ~karen/jsoc/tt/tc.c Number of different pids shown from my test program with -static: lws 2 Linux lws.Stanford.EDU 2.4.21-sgi305rp05060914_10173 #1 SMP Thu Jun 9 14:16:44 PDT 2005 ia64 ia64 ia64 GNU/Linux NPTL 0.60 kehcheng 1 Linux kehcheng.Stanford.EDU 2.6.18-53.1.4.el5 #1 SMP Thu Nov 29 12:18:58 EST 2007 x86_64 x86_64 x86_64 GNU/Linux NPTL 2.5 l5m5 2 Linux l5-m5.Stanford.EDU 2.6.9-42.0.3.ELsmp #1 SMP Thu Oct 5 16:29:37 CDT 2006 x86_64 x86_64 x86_64 GNU/Linux NPTL 2.3.4 corona 2 Linux corona.Stanford.EDU 2.6.9-42.0.3.ELsmp #1 SMP Thu Oct 5 15:04:03 CDT 2006 i686 i686 i386 GNU/Linux NPTL 2.3.4 Note: In DRMS, all threads must have the same pid for the socket connect version to work. ---------------------------------------------------------------------- quotes from kehcheng's email on Jul 20, 2006 Karen, I've written a simple pthread test program ~kehcheng/test.c and tried it on various linux platforms. The number of processes created by the program (as shown by the "ps -m" command) for each platform is lws kehcheng corona static link 4 4 1 dynamic link 3 1 1 The version of the pthread library for each platform (as returned by the "getconf GNU_LIBPTHREAD_VERSION" command) is NPTL 0.60 for lws, NPTL 2.3.4 for kehcheng, and NPTL 2.3.6 for corona. NPTL, or Native Posix Thread Library, is the Posix conforming library that replaced the old and much maligned linux threads. Based on RHEL3, lws' NPTL is apparently too primitive to achieve the goal of one process for all threads. The FC4 based corona on the other hand behaves in exactly the way a modern platform should behave. I expect RHEL5/SLES10 and later platforms to be like that, too. While you're debugging DRMS, if something works only when statically linked, then I'm afraid it's going to break later.