05-01-08 -------- - Bug fix - Precision in DRMS query and insert - Precision in PostgreSQL pg_dump . Why study pg_dump? . Why not binary format? . Keh-Cheng's analysis . Keh-Cheng's experience with pg_dump -Fc on 8.3 . My experience with pg_dump -Fc on 8.1 Bug fix -------- 1. fixed problem with drms_keylist_memsize() It used to only allows keywords, added recnum upon Phil's request. For now, treat recnum differently from the rest of the keywords. In general, there are a couple of ways to query column names in a table SELECT a.attname FROM pg_attribute a, pg_class c, pg_namespace n WHERE c.oid = a.attrelid and c.relname='tablename' and n.nspname = 'tablenamespace' and a.attnum > 0 and NOT a.attisdropped select column_name, table_name, table_schema from information_schema.columns where table_name = 'tablename' and table_schema = 'tableschema'; 2. fixed a memory bug in drms_keylist_memsize() Precision in DRMS query and insert ---------------------------------- Checked precision for in and out params in DRMS query and insert. db_bulk_insert_array_insert() uses copy binary db_bulk_insertv() is a variant of the above query: DB_Binary_Result_t *db_query_bin(DB_Handle_t *dbin, char *query_string) in: always text format? out: binary this is what drms_record.c uses drms_retrieve_records() called by drms_open_records() DB_Binary_Result_t *db_query_bin_array(DB_Handle_t *dbin, char *query, int n_args, DB_Type_t *intype, void **argin ) DB_Binary_Result_t *db_query_binv(DB_Handle_t *dbin, char *query, ...); a variant of the above in: query string contains ?'s or params, whose values are passed in binary out: query result in either binary or text format Precision in PostgreSQL pg_dump ------------------------------- Why study pg_dump? I started looking into the precision issue in PostgreSQL, I chose pg_dump for the following reasons: - pg_dump is the primary tool to migrate between major PostgreSQL software releases - pg_dump uses COPY command in text mode, the same as Slony-I (slony1-dump.sh for example, as well as others). ---------------------------------------------------------------------- Why not binary format? pg_dump stopped supporting binary format for portability, i.e., binary data can't move across computer architecture, one might run into problems like big/little endianess. Binary is also problematic for version change, e.g., suppose the internal format for a data type changed between version. In summary, pg_dump is in text to preserve platform and version independence. http://archives.postgresql.org/pgsql-hackers/2007-12/msg00143.php ---------------------------------------------------------------------- Keh-Cheng's analysis I learned quite a few things from Keh-Cheng who is an expert on scientific computation. Here is a quote from his email on Sat Apr 26, 2008 "Having examined the actual on-disk contents of a few test tables and the corresponding pg_dump output, I believe the current version (8.3.1) of postgresql handles double precision correctly albeit very inefficiently. Unless there is some undocumented way of dumping IEEE floating-point numbers in binary mode, it is SINGLE precision values that we need to worry about. Here is why: Pg_dump reads double/single-precision IEEE numbers in the table and converts them to ASCII with a format string equivalent to "%.17g"/"%.8g". For IEEE double with a 53-bit mantissa, one unit in the last place (ulp) has a worst-case relative round-off error of 1/2 * 2^(-53) = 5.55e-17 while an ASCII string with 17 significant figures has a worst-case round-off error of 1/2 * 10^(-16) = 5e-17. Since the decimal representation has slightly denser "ticmarks" than the binary representation, a conversion from binary to decimal (during dump) followed by a conversion from decimal back to binary (during restore) will always yield the same IEEE double precision number that was originally in the table. This is OK but very inefficient. (Binary/decimal conversions typically cost thousands of CPU cycles.) Doing the same analysis for IEEE single (24-bit mantissa) and 8-significant-figure decimal numbers, one finds the latter to have HIGHER round-off error (5e-8 vs 2.98e-8). In other words, an IEEE single converted to decimal with the "%.8g" format CANNOT be expected to be converted back to the original value. If you compile and run the attached test program, you'll see that of the 1 million randomly chosen numbers between 0 and 1, approximately 0.7% of them fail the double conversion test. (As expected, all the failure cases have leading significant digits 10, 11, 12, or 13 because they have the highest relative round-off errors in the range.) If you give the decimal conversion just one more significant figure by changing the "%.8g" format to "%.9g", then all of them pass." Keh-Cheng also located two places in the src code to rectify the problem with floats. I went on and browsed two related papers on this subject: How to read floating point numbers accurately William D. Clinger How to print floating-point numbers accurately Guy L. Steele, Jr., Jon L. White Discussion on this subject from PostgreSQL mailing list http://qaix.com/postgresql-database-development/203-042-float-output-precision-questions-read.shtml ---------------------------------------------------------------------- Keh-Cheng's experience with pg_dump -Fc on 8.3 Keh-Cheng reported promising results with pg_dump -Fc. This custom option appearantly gets floats correctly in 8.3 ---------------------------------------------------------------------- My experience with pg_dump -Fc on 8.1 I did not have such good luck with 8.1. The precision problem remains. Unfortunately that's the issue we have to face, i.e., moving from a 8.1 server to a 8.3 server. I followed Keh-Cheng's advice to change the src code. I ran pg_dump and pg_restore from this recompiled src, the precision problem was not fixed. It could have something to do with I am still running 8.1 server without the src change. Shall try that next.