written 8.7 years ago by |
Following the actual types of data that are contributing to the ever growing collection of data referred to as big data. Specifically we focus on the data created outside of an organization, which can be grouped into two broad categories: structured and unstructured.
Structured Data
1. Created
Created data is just that; data businesses purposely create, generally for market research. This may consist of customer surveys or focus groups. It also includes more modern methods of research, such as creating a loyalty program that collects consumer information or asking users to create an account and login while they are shopping online.
2. Provoked
A Forbes Article defined provoked data as, “Giving people the opportunity to express their views.” Every time a customer rates a restaurant, an employee, a purchasing experience or a product they are creating provoked data. Rating sites, such as Yelp, also generate this type of data.
3. Transacted
Transactional data is also fairly self-explanatory. Businesses collect data on every transaction completed, whether the purchase is completed through an online shopping cart or in-store at the cash register. Businesses also collect data on the steps that lead to a purchase online. For example, a customer may click on a banner ad that leads them to the product pages which then spurs a purchase. As explained by the Forbes article, “Transacted data is a powerful way to understand exactly what was bought, where it was bought, and when. Matching this type of data with other information, such as weather, can yield even more insights. (We know that people buy more Pop-Tarts at Walmart when a storm is predicted.)”
4. Compiled
Compiled data is giant databases of data collected on every U.S. household. Companies like Acxiom collect information on things like credit scores, location, demographics, purchases and registered cars that marketing companies can then access for supplemental consumer data.
5. Experimental
Experimental data is created when businesses experiment with different marketing pieces and messages to see which are most effective with consumers. You can also look at experimental data as a combination of created and transactional data.
Unstructured Data
People in the business world are generally very familiar with the types of structured data mentioned above. However, unstructured is a little less familiar not because there’s less of it, but before technologies like NoSQL and Hadoop came along, harnessing unstructured data wasn’t possible. In fact, most data being created today is unstructured. Unstructured data, as the name suggests, lacks structure. It can’t be gathered based on clicks, purchases or a barcode, so what is it exactly?
6. Captured
Captured data is created passively due to a person’s behavior. Every time someone enters a search term on Google that is data that can be captured for future benefit. The GPS info on our smartphones is another example of passive data that can be captured with big data technologies.
7. User-generated
User-generated data consists of all of the data individuals are putting on the Internet every day. From tweets, to Facebook posts, to comments on news stories, to videos put up on YouTube, individuals are creating a huge amount of data that businesses can use to better target consumers and get feedback on products.
Big data is made up of many different types of data. The seven listed above comprise types of external data included in the big data spectrum. There are, of course, many types of internal data that contribute to big data as well, but hopefully breaking down the types of data helps you to better see why combining all of this data into big data is so powerful for business.
Another way to classify Big data is:
we can characterize big data into five different types:
Sensors/meters and activity records from electronic devices: These kind of information is produced on real-time, the number and periodicity of observations of the observations will be variable, sometimes it will depend of a lap of time, on others of the occurrence of some event (per example a car passing by the vision angle of a camera) and in others will depend of manual manipulation (from an strict point of view it will be the same that the occurrence of an event). Quality of this kind of source depends mostly of the capacity of the sensor to take accurate measurements in the way it is expected.
Social interactions: Is data produced by human interactions through a network, like Internet. The most common is the data produced in social networks. This kind of data implies qualitative and quantitative aspects which are of some interest to be measured. Quantitative aspects are easier to measure tan qualitative aspects, first ones implies counting number of observations grouped by geographical or temporal characteristics, while the quality of the second ones mostly relies on the accuracy of the algorithms applied to extract the meaning of the contents which are commonly found as unstructured text written in natural language, examples of analysis that are made from this data are sentiment analysis, trend topics analysis, etc.;
Business transactions: Data produced as a result of business activities can be recorded in structured or unstructured databases. When recorded on structured data bases the most common problem to analyze that information and get statistical indicators is the big volume of information and the periodicity of its production because sometimes these data is produced at a very fast pace, thousands of records can be produced in a second when big companies like supermarket chains are recording their sales. But these kind of data is not always produced in formats that can be directly stored in relational databases, an electronic invoice is an example of this case of source, it has more or less an structure but if we need to put the data that it contains in a relational database, we will need to apply some process to distribute that data on different tables (in order to normalize the data accordingly with the relational database theory), and maybe is not in plain text (could be a picture, a PDF, Excel record, etc.), one problem that we could have here is that the process needs time and as previously said, data maybe is being produced too fast, so we would need to have different strategies to use the data, processing it as it is without putting it on a relational database, discarding some observations (which criteria?), using parallel processing, etc. Quality of information produced from business transactions is tightly related to the capacity to get representative observations and to process them;
Electronic Files: These refers to unstructured documents, statically or dynamically produced which are stored or published as electronic files, like Internet pages, videos, audios, PDF files, etc. They can have contents of special interest but are difficult to extract, different techniques could be used, like text mining, pattern recognition, and so on. Quality of our measurements will mostly rely on the capacity to extract and correctly interpret all the representative information from those documents;
- Broadcastings: Mainly referred to video and audio produced on real time, getting statistical data from the contents of this kind of electronic data by now is too complex and implies big computational and communications power, once solved the problems of converting "digital-analog" contents to "digital-data" contents we will have similar complications to process it like the ones that we can find on social interactions.