Analysis of data distribution strategy of greenplus
Greenplus is an MPP database, which is composed of a master and multiple segments (a standby master can also be configured optionally). Its data will be distributed on different segments according to the set distribution strategy.
In version 6, gp provides three strategies: random distribution, replication distribution and hash distribution.
random distribution
When creating a table, use the "distributed random" clause.
This strategy will make the data randomly distributed to each segment. Even two rows of exactly the same data may be scattered to different segments. Although random distribution can make the data evenly distributed to all segments (without data skew), the data will still be redistributed according to the association key during table association analysis, so this strategy is rarely used in the production environment.
Replication distribution
When creating a table, use the "DISTRIBUTED REPLICATED" clause.
This policy will send data to all segments, that is, all segments have all the data of the table. Therefore, during table association analysis, data redistribution can be reduced, but the data will be saved to all segments, so a large amount of duplicate data will be generated. Therefore, this strategy is suitable for some small tables.
hash distribution
When rebuilding the table, use "DISTRIBUTED BY (column, [...])" Clause.
This policy requires the user to specify which columns are used as distribution keys, and the distribution keys must be a subset of the primary key. gp will calculate the hash key value according to the value of the distribution key, and then calculate the segment to which the data is allocated according to the key value. Users can specify different distribution keys for different tables in combination with their own data characteristics and the laws of data analysis in the future, so as to provide good data storage and data analysis performance.
hash process
The call stack is directly posted here, focusing on the analysis of the directDispatchCalculateHash function:
call Stack
#0 cdbhashinit (h=0x2e4e738) at cdbhash.c:161 #1 0x0000000000b05017 in directDispatchCalculateHash (plan=0x2e4dce8, targetPolicy=0x2e4e178, hashfuncs=0x2e4e6b8) at cdbmutate.c:197 #2 0x0000000000b0a989 in sri_optimize_for_result (root=0x2e4cf18, plan=0x2e4dce8, rte=0x2e4cd88, targetPolicy=0x7ffe5fca0ec0, hashExprs_p=0x7ffe5fca0ed0, hashOpfamilies_p=0x7ffe5fca0ec8) at cdbmutate.c:3560 #3 0x0000000000810d6e in adjust_modifytable_flow (root=0x2e4cf18, node=0x2e4e068, is_split_updates=0x2e4d9b8) at createplan.c:6608 #4 0x00000000008108bd in make_modifytable (root=0x2e4cf18, operation=CMD_INSERT, canSetTag=1 '\001', resultRelations=0x2e4e038, subplans=0x2e4dfe8, withCheckOptionLists=0x0, returningLists=0x0, is_split_updates=0x2e4d9b8, rowMarks=0x0, epqParam=0) at createplan.c:6471 #5 0x0000000000817e24 in subquery_planner (glob=0x2cbcf70, parse=0x2d7cd80, parent_root=0x0, hasRecursion=0 '\000', tuple_fraction=0, subroot=0x7ffe5fca11b8, config=0x2e4cee8) at planner.c:907 #6 0x0000000000816d1d in standard_planner (parse=0x2d7cd80, cursorOptions=0, boundParams=0x0) at planner.c:345 #7 0x0000000000816904 in planner (parse=0x2cbd080, cursorOptions=0, boundParams=0x0) at planner.c:200 #8 0x00000000008e8f4a in pg_plan_query (querytree=0x2cbd080, cursorOptions=0, boundParams=0x0) at postgres.c:959 #9 0x00000000008e8ffd in pg_plan_queries (querytrees=0x2d7b458, cursorOptions=0, boundParams=0x0) at postgres.c:1018 #10 0x00000000008ea3e8 in exec_simple_query ( query_string=0x2cbc0d8 "insert INTO hash values (1,'asdf','fdsa','qwer');") at postgres.c:1748 #11 0x00000000008ef189 in PostgresMain (argc=1, argv=0x2c9bc10, dbname=0x2c9bac0 "postgres", username=0x2c9baa8 "gpadmin") at postgres.c:5242 #12 0x000000000086db12 in BackendRun (port=0x2cc5830) at postmaster.c:4811 #13 0x000000000086d1da in BackendStartup (port=0x2cc5830) at postmaster.c:4468 #14 0x0000000000869424 in ServerLoop () at postmaster.c:1948 #15 0x00000000008689c3 in PostmasterMain (argc=6, argv=0x2c99c20) at postmaster.c:1518 #16 0x0000000000774e33 in main (argc=6, argv=0x2c99c20) at main.c:245
directDispatchCalculateHash
Only the key codes and comments of the directDispatchCalculateHash function are posted here:
static void directDispatchCalculateHash(Plan *plan, GpPolicy *targetPolicy, Oid *hashfuncs) { // ..... The above code is omitted // Create a cdbHash environment for the currently inserted data session // It mainly includes: // 1. Number of segment s of the current gp // 2. reduce function from hash key value to segment // 3. The distribution key of the table and the function for calculating hash key corresponding to the distribution key type h = makeCdbHash(targetPolicy->numsegments, targetPolicy->nattrs, hashfuncs); // Initialize cdbHash, mainly to initialize the hashkey value cdbhashinit(h); // Traverse all distribution keys // nattrs is the number of distribution keys for (i = 0; i < targetPolicy->nattrs; i++) { // Calculate hash key value cdbhash(h, i + 1, values[i], nulls[i]); } // According to the hash key calculated above, // Then calculate which segment the data should be mapped to hashcode = cdbhashreduce(h); // ...... The following code is omitted }
cdbhash
void cdbhash(CdbHash *h, int attno, Datum datum, bool isnull) { uint32 hashkey = h->hash; // ...... Omit some non critical code /* rotate hashkey left 1 bit at each step */ hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0); if (!isnull) { FunctionCallInfoData fcinfo; uint32 hkey; InitFunctionCallInfoData(fcinfo, &h->hashfuncs[attno - 1], 1, InvalidOid, NULL, NULL); fcinfo.arg[0] = datum; fcinfo.argnull[0] = false; hkey = DatumGetUInt32(FunctionCallInvoke(&fcinfo)); /* Check for null result, since caller is clearly not expecting one */ if (fcinfo.isnull) elog(ERROR, "function %u returned NULL", fcinfo.flinfo->fn_oid); hashkey ^= hkey; } // ...... Omit some non critical code h->hash = hashkey; }
analysis:
1. InitFunctionCallInfoData this macro expands to:
#define InitFunctionCallInfoData(Fcinfo, Flinfo, Nargs, Collation, Context, Resultinfo) \ do { \ (Fcinfo).flinfo = (Flinfo); \ (Fcinfo).context = (Context); \ (Fcinfo).resultinfo = (Resultinfo); \ (Fcinfo).fncollation = (Collation); \ (Fcinfo).isnull = false; \ (Fcinfo).nargs = (Nargs); \ } while (0)
This is mainly used to initialize the fcinfo structure. The fcinfo type is FunctionCallInfoData, which is defined as: typedef Datum (*PGFunction) (FunctionCallInfo fcinfo);.
FunctionCallInfoData is a general input parameter structure used to pass callback functions,
Of which:
a. The flinfo field is a structure with the type FmgrInfo. The most important structure is fn_addr field, which stores the address of the hash callback function actually called later.
b. The nargs field indicates the number of input parameters of the callback function, which is fixed to 1 here, indicating that the number of input parameters of all hash functions is only 1.
2. The arg field in FunctionCallInfoData indicates the callback function input parameter list. Only the datum assignment is used here. As can be seen from the outer function, this value is the value of the current column.
Therefore, it can be determined from here that the input parameters of the hash callback function used by the distribution key are transmitted through the encapsulated FunctionCallInfoData structure, and there is only one input parameter of the hash function used in the end, which is the value of the distribution key.
3. The expanded FunctionCallInvoke is (* (fcinfo) - > flinfo - > FN_ Addr) (fcinfo)), that is, the hash callback function is really called here, and the previously assigned fcinfo is used as the parameter.
4. Finally, the return value of the hash callback function is forcibly converted to uint32 type. After XOR operation with the previously calculated hash key, it is saved as the last hash key in the hash in the current cdbHash environment, that is, the final assignment: H - > hash = hashkey.
summary
In the outer layer, first create a hash environment for the current session, then traverse each distribution key to make a hash calculation, and make a reduce according to the final hash key value to calculate the segment id.
In the inner layer, first initialize the input parameters of the general callback function, then call the callback function, and perform an XOR operation with the previous hash key value to obtain the current hash key.
Analysis of hash callback function
smallint / int / bigint type
smallint type, the corresponding hash function is hashint2,
int type, the corresponding hash function is hashint4,
bigint type. The corresponding hash function is hashint8,
The specific implementation is as follows:
#define PG_GETARG_DATUM(n) (fcinfo->arg[n]) #define PG_GETARG_INT16(n) DatumGetInt16(PG_GETARG_DATUM(n)) #define PG_GETARG_INT32(n) DatumGetInt32(PG_GETARG_DATUM(n)) #define PG_GETARG_INT64(n) DatumGetInt64(PG_GETARG_DATUM(n)) Datum hashint2(PG_FUNCTION_ARGS) { return hash_uint32((int32) PG_GETARG_INT16(0)); } Datum hashint4(PG_FUNCTION_ARGS) { return hash_uint32(PG_GETARG_INT32(0)); } Datum hashint8(PG_FUNCTION_ARGS) { /* * The idea here is to produce a hash value compatible with the values * produced by hashint4 and hashint2 for logically equal inputs; this is * necessary to support cross-type hash joins across these input types. * Since all three types are signed, we can xor the high half of the int8 * value if the sign is positive, or the complement of the high half when * the sign is negative. */ int64 val = PG_GETARG_INT64(0); uint32 lohalf = (uint32) val; uint32 hihalf = (uint32) (val >> 32); lohalf ^= (val >= 0) ? hihalf : ~hihalf; return hash_uint32(lohalf); }
After expanding the macro, you can observe that smallint, int and bigint are actually hash functions called at the bottom_ The only difference is uinthash_ Input parameter of uint32.
When the type is smallint or int, the input parameter is itself. When the type is bigint, the length of the type is 8 bytes, so it needs to be handled: when the value to be hashed is greater than or equal to 0, the XOR value of the upper 4 bytes and the 4th byte is used for hashing; When the value to be hashed is less than 0, the reverse number of the high 4 bytes is used to hash with the low 4 byte XOR value.
char / varchar / text type
char type. The corresponding hash function is hashbpchar,
text / varchar type, corresponding hash function: hashtext,
The specific implementation is as follows:
typedef struct varlena text; #define PG_GETARG_DATUM(n) (fcinfo->arg[n]) #define PG_DETOAST_DATUM_PACKED(datum) \ pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum)) #define DatumGetTextPP(X) ((text *) PG_DETOAST_DATUM_PACKED(X)) #define PG_GETARG_TEXT_PP(n) DatumGetTextPP(PG_GETARG_DATUM(n)) Datum hashtext(PG_FUNCTION_ARGS) { text *key = PG_GETARG_TEXT_PP(0); Datum result; /* * Note: this is currently identical in behavior to hashvarlena, but keep * it as a separate function in case we someday want to do something * different in non-C locales. (See also hashbpchar, if so.) */ result = hash_any((unsigned char *) VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key)); /* Avoid leaking memory for toasted inputs */ PG_FREE_IF_COPY(key, 0); return result; } typedef struct varlena BpChar; #define PG_GETARG_DATUM(n) (fcinfo->arg[n]) #define PG_DETOAST_DATUM_PACKED(datum) \ pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum) #define DatumGetBpCharPP(X) ((BpChar *) PG_DETOAST_DATUM_PACKED(X)) #define PG_GETARG_BPCHAR_PP(n) DatumGetBpCharPP(PG_GETARG_DATUM(n)) Datum hashbpchar(PG_FUNCTION_ARGS) { BpChar *key = PG_GETARG_BPCHAR_PP(0); char *keydata; int keylen; Datum result; keydata = VARDATA_ANY(key); keylen = bcTruelen(key); result = hash_any((unsigned char *) keydata, keylen); /* Avoid leaking memory for toasted inputs */ PG_FREE_IF_COPY(key, 0); return result; }
After expanding the above macro and comparing these three types of hash functions, it is not difficult to find that the bottom of their hash functions are the same, all through hash_any function is used for calculation, and the input parameters are their own string value and string length.
Attachment: hash functions corresponding to all types
type | alias | function |
---|---|---|
smallint | int2 | hashint2 |
integer | "int | int4" |
bigint | int8 | hashint8 |
bit | bithash | |
bit varying | varbit | bithash |
boolean | bool | hashchar |
bytea | hashvarlena | |
character[(n)] | char[(n)] | hashbpchar |
character varying[(n)] | varchar[(n)] | hashtext |
text | hashtext | |
cidr | hashinet | |
date | hashint4 | |
inet | hashinet | |
interval | interval_hash | |
jsonb | jsonb_hash | |
macaddr | hashmacaddr | |
numeric | decimal | hash_numeric |
time[without time zone] | float4 | time_hash |