Explain sort-merge join algorithm in query processing.

153views

written 8.4 years ago by

teamques10 ★ 68k

If the records of R and S are physically sorted by value of the join attributes AandB, respectively, the join can be implemented in the most efficient way possible.
Both files are scanned concurrently in order of the join attributes, matching the records that have the same values for A and B.
One cursor is maintained per table and the cursor moves forward.
If the files are not sorted, they may be sorted first by using external sorting. In this method, pairs of file blocks are copied into memory buffers in order and the records of each file are scanned only once each for matching with the other file unless both A and B are non-key attributes, in which case the method needs to be modified slightly

Sort-Merge Join Algorithm:

R(i) refers to the ith record in file R.

S(j) refers to jth record in file S

sort the tuples in R on attribute A;  //assume R has n tuples (records) 

sort the tuples in S on attribute B;  //assume S has m tuples (records) 

seti ← 1, j ← 1; 
while (i ≤ n) and (j ≤ m) 
do { 
     if R(i)[A] > S(j)[B] 
     then set j ← j + 1 
     elseif R(i)[A] < S(j)[B] 
     then set i ← i + 1 
     else  {
      output the combined tuple <R(i), S(j)> to T;
         //R(i)[A] = S(j)[B], so we output a matched tuple *) 

(* output other tuples that match R(i), if any *) 
set I ← j + 1; 
while (l ≤ m) and (R(i)[A] = S(l)[B]) 
do
{ output the combined tuple <R(i), S(l)> to T; 
set l ← l + 1 
}
(* output other tuples that match S(j), if any *) 
set k ← i + 1; 
while (k ≤ n) and (R(k)[A] = S(j)[B]) 
do
{ output the combined tuple <R(k), S(j)> to T; 
set k ← k + 1 
} 
seti ← k, j ← l
}
}

Best Case Cost: (M+N) when already sorted.

Worst Case Cost: M log M + N log N + (M+N)

ADD COMMENT EDIT