written 2.6 years ago by |
Data Allocation
- Deciding where to locate data
- Allocation strategies:
- Centralized data allocation
- Entire database is stored at one site
- Partitioned data allocation
- Database is divided into several disjointed parts (fragments) and stored at several sites
- Centralized data allocation
- Replicated data allocation
- Copies of one or more database fragments are stored at several sites
- Data distribution over a computer network is achieved through data partition, data replication, oracombination of both
Fragment Allocation/1
Fragment allocation problem
Given are:
- fragmentsF=(F1, F2,.. Fl
- network sitesS(S1, S2,. Sm)
- and applicationsQ=(q.2..a)
Find: the "optimal" distribution ofFtoS
Optimality
- Minimal cost
- Communication+storage+processing (read and update)
- Cost in terms of time (usually)
- Performance
- Response time and/or throughput
- Constraints
- Per site constraints (storage and processing)
- Response time and/or throughput
Fragment Allocation/2
- Required information
- Database Information
- selectivity of fragments
- size ofafragment
- Database Information
Application Information
- RR: number of read accesses ofaquery q, toafragmentF
- UR: number of update accesses of query q, toafragment F,
- uj:amatrix indicating which queries updates which fragments,
- rj:a similar matrix for retrievals
- originating site of each query
- Site Information
- USC: unit cost of storing data atasite S.
- LPC: cost of processing one unit of data atasiteS
- Network Information
- communication cost/frame between two sites
- frame size
Fragment Allocation/3
We discuss an allocation model which attempts to
minimize the total cost of processing and storage
meet certain response time restrictions
General Form: $\min$ (Total Cost)
subject to
response time constraint
storage constraint
processing constraint
Decision variable $x_{\bar{j}}$ $$ x_{i j}= \begin{cases}1 & \text { if fragment } F_{i} \text { is stored at site } S_{j} \\ 0 & \text { otherwise }\end{cases} $$
Fragment Allocation/4
The total cost function has two components: storage and query processing. $$ T O C=\sum_{S_{i} \in S} \sum_{F_{j} \in F} S T C_{j k}+\sum_{q \in Q} Q P C_{i} $$
Storage cost of fragment $F_{j}$ at site $S_{k}$ : $$ \operatorname{STC}_{j k}=U S C_{k}=\operatorname{size}\left(F_{j}\right)=x_{i j} $$ where $U S C_{k}$ is the unit storage cost at site $k$ - Query processing cost for a query $q_{i}$ is composed of two components: - composed of processing cost (PC) and transmission cost (TC) $Q P C_{i}=P C_{i}+T C_{i}$ **Fragment Allocation/5** - Processing cost is a sum of three components: - access cost (AC), integrity contraint cost (IE), concurency control cost (CC) $$ P C_{i}=A C_{i}+\mathbb{E}_{i}+C C_{i} $$
Access cost: $$ A C_{i}=\sum_{s \in S} \sum_{F_{i} F}\left(U R_{i j}+R R_{i j}\right)=x_{i j}=L P C_{k} $$ where $L P C_{k}$ is the unit process cost at site $k$ - Integrity and concurrency costs: - Can be similarly computed, though depends on the specific constraints **Fragment Allocation/6** - The transmission cost is composed of two components: - Cost of processing updates (TCU) and cost of processing retrievals (TCR) $$ T C_{i}=T C U_{i}+T C R_{i} $$
Cost of updates:
Inform all the sites that have replicas + a short confirmation message back $T C U_{i}=\sum_{S_{u} \in S} \sum_{F_{i} \in F} u_{i j} *$ (update message cost $+$ acknowledgment cost) - Retrieval cost: - Send retrieval request to all sites that have a copy of fragments that are needed + sending back the results from these sites to the originating **Fragment Allocation/7** - Modeling the constraints - Response time constraint for a query $q_{i}$ execution time of $q_{i} \leq \max$. allowable response time for $q_{i}$ - Storage constraints for a site $S_{k}$ $\sum_{F \in F}$ storage requirement of $F_{j}$ at $S_{k} \leq$ storage capacity of $S_{k}$
Processing constraints for a site $S_{k}$ $\sum_{q \in O}$ processing load of $q$ at site $S_{k} \leq$ processing capacity of $S_{k}$
Fragment Allocation/8
Solution Methods
- The complexity of this allocation model/problem is NP-complete
- Correspondence between the allocation problem and similar
problems in other areas
- Plant location problem in operations research
- Knapsack problem
- Network flow problem
- Plant location problem in operations research
Hence, solutions from these areas can be re-used
Use different heuristics to reduce the search space
- Assume that all candidate partitionings have been determined together with their associated costs and benefits in terms of query processing.
- The problem is then reduced to find the optimal partitioning and placement for each relation
- Ignore replication at the first step and find an optimal non-replicated solution
- Replication is then handeled inasecond step on top of the previous non-replicated solution.