Lab3 query optimization
Exercise 1 selectivity estimation based on histogram
The students who just read this chapter may be a little confused. Why did they say a lot of statistics? Because the essence of query optimization is to know in advance how to execute statements faster. Statistics can help us better estimate how much time it takes to select a better execution order.
Before doing Exercise 1, it is strongly recommended to carefully read the Optimizer outline in the learning documents provided in this chapter, especially the Filter Selectivity in 2.2. After reading it, you can understand the selectivity estimation based on histogram (if you don't understand it, you can read it down. I described the principle of the code in detail), In this estimation, the elements in each bucket are evenly distributed (idealized here, not uniform in fact). After understanding the principle, we can start writing code.
-
IntHistogram.java implements the value type histogram selective estimation, so you need to instantiate the maximum value, minimum value (to calculate how wide each bucket is) and the number of buckets in the constructor.
public IntHistogram(int buckets, int min, int max) { // some code goes here this.buckets = new int[buckets]; this.min = min; this.max = max; this.width = (1.+max-min)/this.buckets.length; }
Then, implement some simple functions such as getIndex and addValue. Finally, implement the most difficult function estimateSelectivity in this class to judge the probability that should be returned through the passed operators and numbers. Take less_ Then and 8, for example, to select a probability smaller than 8, that is, count how many numbers are smaller than 8 in front. If the bucket width is 3, then the first two buckets are smaller than 8, and there is the following for loop.
for(int i=0;i<index;++i){ cnt += buckets[i]; }
The count in cnt is smaller than 8, and then 1 / 3 of the third bucket is smaller than 8, that is, the number of 7 in the bucket is smaller than 8. How many 7's are there in the third bucket? We know from the previous information that 7's are evenly distributed in the bucket, and there are buckets[index] in the bucket, so (1 / 3) * buckets[index] is the number of 7. Then add the number of cnt statistics, and take cnt/ntups as the probability. The buckets[index] / width * (V-Index * width min) in the following code is (1 / 3) * buckets[index], (V - index * width min) is to see how many buckets are smaller than the given number in the current bucket.
public double estimateSelectivity(Predicate.Op op, int v) { // some code goes here if(op.equals(Predicate.Op.LESS_THAN)){ if(v <= min) return 0.0; if(v >= max) return 1.0; final int index = getIndex(v); double cnt = 0; for(int i=0;i<index;++i){ cnt += buckets[i]; } cnt += buckets[index]/width*(v-index*width-min); return cnt/ntups; } if (op.equals(Predicate.Op.LESS_THAN_OR_EQ)) { return estimateSelectivity(Predicate.Op.LESS_THAN, v+1); } if (op.equals(Predicate.Op.GREATER_THAN)) { return 1-estimateSelectivity(Predicate.Op.LESS_THAN_OR_EQ, v); } if (op.equals(Predicate.Op.GREATER_THAN_OR_EQ)) { return estimateSelectivity(Predicate.Op.GREATER_THAN, v-1); } if (op.equals(Predicate.Op.EQUALS)) { return estimateSelectivity(Predicate.Op.LESS_THAN_OR_EQ, v) - estimateSelectivity(Predicate.Op.LESS_THAN, v); } if (op.equals(Predicate.Op.NOT_EQUALS)) { return 1 - estimateSelectivity(Predicate.Op.EQUALS, v); } return 0.0; }
Exercise 2 TableStats implementation
In the previous Exercise, we implemented histogram based selective estimation. In this Exercise, we will count the histogram estimation of each field of all tables in the Database and return the time cost consumed by the query.
-
TableStats. In Java, we will instantiate the constructor to access the consumption of each page, the DbFile to be accessed, the Id of the table, the number of fields in the table, the number of tuples in the table and the number of pages in the DbFile, and store the histogram corresponding to each field. At the same time, in the construction, we will count the maximum and minimum values of each field, and count the number of tuples in a table. After counting the maximum and minimum values, start to construct the histogram corresponding to each field, and then add the value of this field in each Tuple to the histogram. So far, tablestats The constructor in Java is finished.
public TableStats(int tableid, int ioCostPerPage) { // For this function, you'll have to get the // DbFile for the table in question, // then scan through its tuples and calculate // the values that you need. // You should try to do this reasonably efficiently, but you don't // necessarily have to (for example) do everything // in a single scan of the table. // some code goes here numTuples = 0; this.tableid = tableid; this.ioCostPerPage = ioCostPerPage; intHistogramHashMap = new HashMap<Integer, IntHistogram>(); stringHistogramHashMap = new HashMap<Integer, StringHistogram>(); dbFile = Database.getCatalog().getDatabaseFile(tableid); numPages = ((HeapFile)dbFile).numPages(); TupleDesc td = dbFile.getTupleDesc(); numFields = td.numFields(); Type types[] = getTypes(td); int[] mins = new int[numFields]; int[] maxs = new int[numFields]; TransactionId tid = new TransactionId(); SeqScan scan = new SeqScan(tid, tableid,""); try{ scan.open(); for (int i = 0; i < numFields ; i++) { if (types[i] == Type.STRING_TYPE){ continue; } int min = Integer.MAX_VALUE; int max = Integer.MIN_VALUE; while(scan.hasNext()){ if(i == 0) numTuples++; Tuple tuple = scan.next(); IntField field = (IntField)tuple.getField(i); int val = field.getValue(); if(val > max) max = val; if(val < min) min = val; } scan.rewind(); mins[i] = min; maxs[i] = max; } scan.close(); } catch (Exception e) { e.printStackTrace(); } for(int i = 0 ; i < numFields ; i++){ Type type = types[i]; if(type == Type.INT_TYPE){ IntHistogram intHistogram = new IntHistogram(NUM_HIST_BINS,mins[i],maxs[i]); intHistogramHashMap.put(i,intHistogram); }else{ StringHistogram stringHistogram = new StringHistogram(NUM_HIST_BINS); stringHistogramHashMap.put(i,stringHistogram); } } addValueToHist(); } private Type[] getTypes(TupleDesc td){ int numFields = td.numFields(); Type[] types = new Type[numFields]; for(int i=0;i<numFields;++i){ Type t = td.getFieldType(i); types[i] = t; } return types; } private void addValueToHist(){ TransactionId tid = new TransactionId(); SeqScan scan = new SeqScan(tid,tableid,""); try{ scan.open(); while(scan.hasNext()){ Tuple tuple = scan.next(); for(int i=0;i<numFields;++i){ Field field = tuple.getField(i); if(field.getType() == Type.INT_TYPE){ int val = ((IntField)field).getValue(); intHistogramHashMap.get(i).addValue(val); }else{ String val = ((StringField)field).getValue(); stringHistogramHashMap.get(i).addValue(val); } } } scan.close(); }catch (Exception e){ e.printStackTrace(); } }
Then there are some return functions, which return the time spent scanning the page, the time spent traversing the tuples in the whole table, the possibility of selecting unknown field types, the possibility of selecting known field types, and the total number of tuples.
public double estimateSelectivity(int field, Predicate.Op op, Field constant) { // some code goes here double selectivity; if(constant.getType() == Type.INT_TYPE){ IntField intField = (IntField) constant; selectivity = intHistogramHashMap.get(field).estimateSelectivity(op,intField.getValue()); }else{ StringField stringField = (StringField) constant; selectivity = stringHistogramHashMap.get(field).estimateSelectivity(op,stringField.getValue()); } return selectivity; }
Implementation of Exercise 3 Join Cardinality
After completing the selection time cost estimation, we need to estimate the connection cost of the connection (that is, the keyword join in the database).
-
JoinOptimizer. The Java class includes all the methods used to sort and calculate the connection cost. In this exercise, you will write a method for estimating the selectivity and cost of connections, especially.
Implement estimateJoinCost(LogicalJoinNode j, int card1, int card2, double cost1, double cost2). This method estimates the cost of connection j, considering that the input on the left is cardinality card1, the input on the right is cardinality card2, the cost of scanning the input on the left is cost1, and the cost of accessing the input on the right is card2. You can assume that this connection is an NL connection and apply the formula mentioned above.
Implement estimateJoinCardinality(LogicalJoinNode j, int card1, int card2, boolean t1pkey, boolean t2pkey). This method estimates the number of elements output by connection j. given that the input on the left is card1, the input on the right is card2, and the flags t1pkey and t2pkey indicating whether the left and right (respectively) fields are unique (primary key). -
The first is the estimateJoinCost function. You may have forgotten how to calculate the connection cost. In fact, in the officially provided Lab 3 document, the calculation formula has been given in section 2.2.2. You only need to view the parameters passed in by the estimateJoinCost function to correspond to the key points in the formula. But many friends must want to know how this formula came from.
In fact, this involves the knowledge points based on nested circular connection, which is actually the circular connection of two tables. The pseudo code is as follows:
For each tuple r in R do For each tuple s in S do If r and s satisfy the join condition Then output the tuple <r,s>
The estimateJoinCost function is as follows:
public double estimateJoinCost(LogicalJoinNode j, int card1, int card2, double cost1, double cost2) { if (j instanceof LogicalSubplanJoinNode) { // A LogicalSubplanJoinNode represents a subquery. // You do not need to implement proper support for these for Lab 3. return card1 + cost1 + cost2; } else { // Insert your code here. // HINT: You may need to use the variable "j" if you implemented // a join algorithm that's more complicated than a basic // nested-loops join. double cost = cost1 + card1 * cost2 + card1 * card2; return cost; }
-
Then, the estimateJoinCardinality function and the estimatetabjoincardinality function are matched. The estimateJoinCardinality is used to determine how many tuples will be affected by a connection operation (that is, the size of a join cardinality). How can we calculate how many tuples will be affected? It is calculated through the estimatetabjoincardinality function.
public int estimateJoinCardinality(LogicalJoinNode j, int card1, int card2, boolean t1pkey, boolean t2pkey, Map<String, TableStats> stats) { if (j instanceof LogicalSubplanJoinNode) { // A LogicalSubplanJoinNode represents a subquery. // You do not need to implement proper support for these for Lab 3. return card1; } else { return estimateTableJoinCardinality(j.p, j.t1Alias, j.t2Alias, j.f1PureName, j.f2PureName, card1, card2, t1pkey, t2pkey, stats, p.getTableAliasToIdMapping()); }
-
Next, we will analyze the estimatetabjoincardinality function. In fact, it is very clear how to write this function in 2.2.4 of the official Lab 3, but we still intend to answer it.
First of all, we need to specify what this function will return, which is the number of tuples that will be formed if the join succeeds. Then there are several situations.
-
When predict When OP is EQUALS:
- When the field in Table 1 is a join and the corresponding field in Table 2 is also a primary key, the value with a large number of card s is returned.
- When the field of Table 1 is a join and the field of Table 2 is not a primary key, card2 is returned.
- When the field of Table 1 is a join and not a primary key, and the field of Table 2 is a primary key, card1 is returned.
-
When predict When OP is not equal:
- Then the official document gives a formula that returns 0.3 * card1 *card2. It is estimated that this formula has been demonstrated.
-
In the last case, when the card result is less than or equal to 0, 1 is returned directly.
If you understand these situations, you can write the estimateTableJoinCardinality function.
public static int estimateTableJoinCardinality(Predicate.Op joinOp, String table1Alias, String table2Alias, String field1PureName, String field2PureName, int card1, int card2, boolean t1pkey, boolean t2pkey, Map<String, TableStats> stats, Map<String, Integer> tableAliasToId) { int card = 1; // some code goes here if(joinOp == Predicate.Op.EQUALS){ if(t1pkey){ card = card2; }else if(t2pkey){ card = card1; }else{ card = card1>card2 ?card1:card2; } }else{ double temp = 0.3 * card1 *card2; card = (int)temp; } return card <= 0 ? 1 : card; }
-
Exercise 4 orderJoins implementation
Although Exercise 4 only writes one function, it involves several ready-made functions. You need to understand them before you can completely write the orderJoins function.
-
The return value of the orderJoins function is the List. In fact, this is the List collection returned after sorting through the getOrder method of PlanCache class. PlanCache is an auxiliary class that can be used to store the best way to arrange a given set of connections.
Then, define the number of cycles according to the size of the join in the joinOptimizer class, enumerate the collection of all joins through the enumeratiesubsets function, calculate the LogicalJoinNode in each subset in the computeCostAndCardOfSubplan method, find the one with the least cost, store it and save it in the
In planCache.
ublic List<LogicalJoinNode> orderJoins( Map<String, TableStats> stats, Map<String, Double> filterSelectivities, boolean explain) throws ParsingException { //Not necessary for labs 1--3 // See the Lab 3 writeup for some hints as to how this function // should work. // some code goes here //Replace the following int numJoinNodes = joins.size(); PlanCache pc = new PlanCache(); Set<LogicalJoinNode> wholeSet = null; for(int i = 1; i <= numJoinNodes; i++) { Set<Set<LogicalJoinNode>> setOfSubset = this.enumerateSubsets(this.joins, i); for(Set<LogicalJoinNode> s : setOfSubset) { if(s.size() == numJoinNodes) wholeSet = s; Double bestCostSofar = Double.MAX_VALUE; CostCard bestPlan = new CostCard(); // This class is used to save plan, cost and card for (LogicalJoinNode toRemove : s) { CostCard plan = computeCostAndCardOfSubplan(stats, filterSelectivities, toRemove, s, bestCostSofar, pc); if (plan != null) { bestCostSofar = plan.cost; bestPlan = plan; } } if (bestPlan.plan != null) { pc.addPlan(s, bestPlan.cost, bestPlan.card, bestPlan.plan); } } } return pc.getOrder(wholeSet); }