written 8.6 years ago by |
Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.
‘datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.’ Is also referred as big data.
‘Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.
in short,the term Big data applies to information that can’t be processed or analyzed using traditional processes or tools.
Characteristics of Big Data:
Big data can be characterized by 3Vs: the extreme volume of data, the wide variety of types of data and the velocity at which the data must be must processed.
Figure: characteristics of Big Data
Volume:
Volume Refers to the vast amounts of data generated every second. We are not talking Terabytes but Zettabytes or Brontobytes. If we take all the data generated in the world between the beginning of time and 2008, the same amount of data will soon be generated every minute. This makes most data sets too large to store and analyze using traditional database technology. New big data tools use distributed systems so that we can store and analyze data across databases that are dotted around anywhere in the world.
90% of all data ever created, was created in the past 2 years. From now on, the amount of data in the world will double every two years. By 2020, we will have 50 times the amount of data as that we had in 2011. The sheer volume of the data is enormous and a very large contributor to the ever expanding digital universe is the Internet of Things with sensors all over the world in all devices creating data every second.
Variety:
Different Types: Variety describes different formats of data that do not lend themselves to storage in structured relational database systems. These include a long list of data such as documents, emails, social media text messages, video, still images, audio, graphs, and the output from all types of machine-generated data from sensors, devices, RFID tags, machine logs, cell phone GPS signals, DNA analysis devices, and more. This type of data is characterized as unstructured or semi-structured and has existed all along. In fact it’s estimated by some studies to account for 90% or more of the data in organizations.
Different Sources: Variety is also used to mean data from many different sources, both inside and outside of the company. What’s changed is the realization that through analysis it can yield new and valuable insights not previously available.
Velocity
Data-In-Motion: Data scientists like to talk about data-at-rest and data-in-motion. One meaning of Velocity is to describe data-in-motion, for example, the stream of readings taken from a sensor or the web log history of page visits and clicks by each visitor to a web site. This can be thought of as a fire hose of incoming data that needs to be captured, stored, and analyzed. Consistency and completeness of fast moving streams of data are one concern. Matching them to specific outcome events, a challenge raised under Variety is another. Velocity also incorporates the characteristics of timeliness or latency – is the data being captured at a rate or with a lag time that makes it useful.
Lifetime of Data Utility: A second dimension of Velocity is how long the data will be valuable. Is it permanently valuable or does it rapidly age and lose its meaning and importance. Understanding this dimension of Velocity in the data you choose to store will be important in discarding data that is no longer meaningful and in fact may mislead.
Value
Although Value is frequently shown as the fourth leg of the Big Data stool, Value does not differentiate Big Data from not so big data. It is equally true of both big and little data that if we are making the effort to store and analyze it then it must be perceived to have value.
Big Data however is perceived as having incremental value to the organization and many users quote having found actionable relationships in Big Data stores that they could not find in small stores. Certainly it is true that if in the past we were storing data about groups of customers and are now storing data about each customer individually then the granularity of our findings is much finer and we approach that desired end-goal of offering each customer a personalization-of-one in their experience with us.
There are at least four additional characteristics that pop up in the literature from time to time. All of these share the same definitional problems of Value. That is they may be a descriptor of data but not uniquely of Big Data
Veracity: What is the provenance of the data? Does it come from a reliable source? It is accurate and by extension, complete.
Variability: There are several potential meanings for Variability. Is the data consistent in terms of availability or interval of reporting? Does it accurately portray the event reported? When data contains many extreme values it presents a statistical problem to determine what to do with these ‘outlier’ values and whether they contain a new and important signal or are just noisy data.
Viscosity: This term is sometimes used to describe the latency or lag time in the data relative to the event being described. We found that this is just as easily understood as an element of Velocity.
Virality: Defined by some users as the rate at which the data spreads; how often it is picked up and repeated by other users or events.