written 5.6 years ago by | modified 4.6 years ago by |
estimate the number of distinct in this stream using Flajolet- Martin algorithm
written 5.6 years ago by | modified 4.6 years ago by |
estimate the number of distinct in this stream using Flajolet- Martin algorithm
written 5.6 years ago by | • modified 5.6 years ago |
PCY algorithm exploits the observation that there may be much unused space in main memory on the first pass of apraisy.
In first pass only a bash function is applied on pair of item so that they bashes to a bucket.
we hash each pair and add 1 to the bucket.
At the end of 1st pass each bucket has count.
If count of bucket is > = support then the bucket is frequent bucket.
we can define a candidate pair C2 to be those pair { i , j } such that
1) i and j are frequent items.
2) { i , j } hashes to a frequent bucket.
Multistage uses several successive hash tables to reduce further the number of candidate pairs.
But requires more passes than normal PCY.
Document data store are used for storing, retrieving and managing document oriented information.
Hierarchical data structures can be directly stored in document database.
Document data store uses a key structure.
Document path is used like a key to success the leaf values of a document tree structure.
- For eg. Employee [id = '2300'] / Address / street / text ( )
- eg of document store: Mongo DB, Couch DB, Couch Base.
In column based store, data is stoned into columns and these columns are logically grouped into column families.
splitting the data column size helps in faster retrieval of data
example : Big Table, H Base, Hyper Table.