02/21/08 -------- Summary - Performance of GROUP BY and ORDER BY - Development timeline for DRMS prime key Performance of GROUP BY and ORDER BY ------------------------------------ I tracked the history of prime key related development (see the next topic) to help myself to define testing targets for GROUP/ORDER BY. Q: Can we improve GROUP/ORDER BY performance by composite index or any other index? This is the ultimate question for this performance test. I can't answer that yet. Q: Where is GROUP BY used? A: We use GROUP BY at the end of a query to filter out the latest version. Besides this "outermost" GROUP BY, there is also "inner" GROUP BY to avoid selecting among earlier version records (query anomaly). Q: What affects GROUP BY performance most? A: The number of records to be grouped by. For highly selective queries where we expect only a handful of returns, GROUP/ORDER BY performance does not matter. It only matters if we return thousands of records. For example, hmi_ground.hk_0017_0015 has about half million records in it, a naive implementation of GROUP BY results in more than one minute query time. This can be reduced to about 20 seconds. Still pretty bad from a DB query point of view. If we only optimize the most frequent queries, i.e., highly selective ones, we don't have a problem. ---------------------------------------------------------------------- Here are some partial test results. Four sets of data from Rick's simulation: /tmp06/rick/tilefile* No. of records table (k1) (k2) (k3) (k1,k2,k3) 01CR 197,856 21MB 7872kB 4360kB 4360kB 11MB 04CR 787,523 82MB 44MB 16CR 3,153,839 329MB 178MB 67CR 13,207,410 1376MB 744MB Same table layout (same keyword names) for all four datasets. recnum | bigint | not null default nextval('mpk_01cr_recnum_seq'::regclass) k1 | text | k2 | double precision | k3 | double precision | k4 | double precision | k5 | double precision | k6 | double precision | I used 'text' for k1 which is a time string. I could have stored it as double. Notation: (k1): index on k1 (k2): index on k2 (k3): index on k3 (k1, k2, k3): composite index on (k1, k2, k3) Q: Does "ORDER BY" use index (k1, k2, k3)? Tried on all four data sets. It depends on server configuration. I only observed that one set of server conf params causes it to use index while another does not. I can't yet explain why. When index is used, the performance gain is more than a factor of 10. I haven't been able to tweak server configuration to make GROUP BY use index. Shall keep trying. Development timeline for DRMS prime key --------------------------------------- DRMS prime keys are used to identify the latest version of a record (in other words, records with the same prime key values are considered different version of the same record). 0. Originally implemented with C function db_maxbygroup() 1. Oct 2006. I changed it to SQL "GROUP BY". Also added "ORDER BY" at the same time 2. Apr 2007. Query anomaly discovered by Phil. Such query anomaly occurs when querying on both prime key and non-prime key. The correct query results is to apply query condition on the latest version of the records. Note that the original implementation with C function db_maxbygroup() can not fulfill such query. A straightforward implementation is to apply "GROUP BY" on the entire series then narrow it down with "WHERE" clause. This was done on April 2007. The performance of "GROUP BY" on a series with a few hundred thousand records is on the order of tens of seconds (was a hundred before optimization). A few things I used to optimize things: a. select * vs select column names b. changed server conf to change query plan from "Sort" to "HashAggregate" c. create views 3. Jun 2007. I created a query formulation for combined prime + non-prime key queries. It was further refined in July 2007. The idea is to avoid running "GROUP BY" on the whole series, instead apply "WHERE" clause and then filter out records that are not the latest version. 4. Sep-Oct 2007. Query performance tests on large tables (100M records) The emphasis was on highly selective queries, i.e., each query returns 1 or 2 records, GROUP/ORDER BY is therefore not an issue. We learned a few things about composite index. For details, see sun.stanford.edu/~karen/jsoc092707 and jsoc100407. The bottomline is that composite index on (k1, k2) is no good for a query that's on k2 only. 5. Jan 2008. Rick wanted more than five prime keys. Five is currently the limit. This is due to the fact that DRMS prime key is implemented as DB composite index. Due to the overhead of composite index, I don't recommend creating a composite index with more than five keys. We decided to have a separate line "DB index" in jsd to specify additional DB indexed keys. The question is whether to keep the composite index.