01/17/08

Summary
- Fixed stdout reopen problem
- Fixed a bunch of long standing compiler warnings: getpeername(),
  connect(), etc.
- Did more testing on begin/end_transaction()
- SUMS caveats
- DRMS/SUMS garbage collection
- Started on writing DRMS code for remote SUMS
- Ran tests on -static flag for pthreads

Fixed stdout reopen problem
---------------------------

rewrote restore_stdeo() function
changed from fclose(stdout) to close(STDOUT_FILENO).


More testing on begin/end_transaction()
---------------------------------------

Just to make sure I haven't broken anything by shuffling things around.

One limitation with begin/end_transaction(): begin/end_transaction()
are server-side function. If a module uses them, it can only be made
in direct connect version. This means irregularity in makefiles.

still not committed due to pending SUMS issues.

SUMS caveats
------------

I ran into a couple of problems with SUMS while testing DRMS
transaction restart code which is needed for lev0 code.

Memory leak:
Valgrind, the memory checker, complains about memory leaks which I
can't locate.  The case is very simple: in the same program, if one
repeatedly calls SUM_open and SUM_close, and something in between,
SUM_get for example, the memory is not freed at the end of
SUM_close.  So one observes the free memory depleting.

I wrote a small program at ~karen/src/karen_test/sum_test3.c. The
problem remains with or without threads (we use threads in DRMS,
that's why I need to test it). 

The memory leak is alarming. I saw about at least 8K per open/close
pair. It would render the transaction restart model unusable if we
don't address it. 

Alternatively, I can avoid starting a new SUM object for every
transaction, i.e., keep using the same SUM object. This also has a
side-effect: all SU accessed by a SUM object are protected somehow,
i.e., can't be removed from the disk until the SUM object is
closed. If we run this type of module for a long time, lev0 processing
for example, we are potentially clogging up disk storage space.

Resource limit?
Another related problem is that I can't run SUM_open and SUM_close
pair closely together, I got "__get_myaddress: socket: Too many open
files" error. In my test program, I have to add sleep(1) at the end of
each open/close pair. It seems to me some resource is held up
somewhere.

DRMS/SUMS garbage collection
----------------------------

Revisit drms/sums garbage collection issue from jun 25, 2007
Got feedback from Rick. Need clarification.

DRMS Records - Permanent (permanently stored in the DRMS database)
- Transient (exist within a session, go away at the end of a session,
   deleted from the DRMS database at the end of a session)
Both are specified in drms_create_records()

SU (I am not sure about the terminology. Jim, please correct me.)
- Permanent (archive = 1 in series definition or -A option for drms_server)
- Temporary (archive = 0 in series definition and no -A  option for
   drms_server): they will be removed from the disk after retention
   time runs out.

We have four combinations of the above: 1. Permanent record, Permanent SU
1. Permanent record, Permanent SU
2. Permanent record, Temporary SU
3. Transient record, Permanent SU
4. Transient record, Temporary SU

I think Jim is considering case 2 where upon removing the temporrary
SU, the permanent records in DRMS database is also removed. I recall
from our meetings, people suggested such segment-less records may
still be of value... I generally feel uncomfortable with SUMS removing
DRMS records and vice versa in such implicit manner.

Cases 3) and 4) involve cleaning up of the SUs. In addition,
delete_series also creates garbage SU. Cleanup for this latter case
and Case 3) are not yet implemented.

Remote SUMS
-----------
need way to assign higher bits in sunum from SUMS.

trouble with drms_su_getsudir() and drms_su_getsudirs(). The later may
have part of the record sets in local SUMS.

Ran tests on -static flag for pthreads
--------------------------------------

~karen/jsoc/tt/tc.c

Number of different pids shown from my test program with -static:
lws       2
Linux lws.Stanford.EDU 2.4.21-sgi305rp05060914_10173 #1 SMP Thu Jun 9
14:16:44 PDT 2005 ia64 ia64 ia64 GNU/Linux
NPTL 0.60

kehcheng  1
Linux kehcheng.Stanford.EDU 2.6.18-53.1.4.el5 #1 SMP Thu Nov 29
12:18:58 EST 2007 x86_64 x86_64 x86_64 GNU/Linux
NPTL 2.5

l5m5      2
Linux l5-m5.Stanford.EDU 2.6.9-42.0.3.ELsmp #1 SMP Thu Oct 5 16:29:37
CDT 2006 x86_64 x86_64 x86_64 GNU/Linux 
NPTL 2.3.4

corona    2
Linux corona.Stanford.EDU 2.6.9-42.0.3.ELsmp #1 SMP Thu Oct 5 15:04:03
CDT 2006 i686 i686 i386 GNU/Linux 
NPTL 2.3.4

Note: In DRMS, all threads must have the same pid for the socket
connect version to work.

----------------------------------------------------------------------
quotes from kehcheng's email on Jul 20, 2006
Karen,

I've written a simple pthread test program ~kehcheng/test.c
and tried it on various linux platforms.  The number of processes
created by the program (as shown by the "ps -m" command) for
each platform is 
                lws    kehcheng      corona
static link      4        4            1
dynamic link     3        1            1

The version of the pthread library for each platform (as returned
by the "getconf GNU_LIBPTHREAD_VERSION" command) is NPTL 0.60
for lws, NPTL 2.3.4 for kehcheng, and NPTL 2.3.6 for corona.

NPTL, or Native Posix Thread Library, is the Posix conforming
library that replaced the old and much maligned linux threads.
Based on RHEL3, lws' NPTL is apparently too primitive to achieve
the goal of one process for all threads.  The FC4 based corona
on the other hand behaves in exactly the way a modern platform 
should behave.  I expect RHEL5/SLES10 and later platforms to be 
like that, too.

While you're debugging DRMS, if something works only when statically
linked, then I'm afraid it's going to break later.