5v’s of Big Data

Big Data: It’s not a technology, it’s a collection of large amounts of data points generating from various sources at a very high speed. And, all this generates a lot of valuable information which can be utilized for the best purposes in every field.

“A revolution that will transform how we live, work and think.”

5v’s that defines Big Data

5v’s that defines Big Data

  1. Volume: It defines the data points which are being generated from various sources in huge volumes and are in the very huge form i.e., of Exabyte and Zettabytes. If we talk about past couple of decades, various big firms’ collected & stored data related to employees only.

Volume Of Big Data

But now, these big firms, apart from collecting all the data of their employees, are also collecting the details of their clients, partners, products & services in which they’re dealing upon, and all this leads to the extension of more and more data. If we calculate the amount of data, which is being generated from the beginning of the time until 2003, is equivalent to data which is currently being generated in every 2 days. So, that’s volume.

 2.Variety: There are mainly three types of data we consider i.e., structured data, unstructured data and semi-structured data. Out of all these, we’re very much familiar with structured data which is in the form of pure text (person’s name) or in numeric (their age) which is stored in external databases. But, the rest of the two types are new in big data.

varite of Big Data

Unstructured data is in the form of PDF files, video files, audio files, images, tweets, likes, comments etc. Semi-structured data is in the form of XML files, JSON files, emails, JavaScript files, sever log files, sensor data, etc. These are the varieties of data which we’re generating from various sources like mobile devices, satellites, social media networks, IT & Non-IT organizations, etc.

3.Velocity: If we’re dealing with huge volume of different types of data, generated from various sources, then the data has to be processed fast which we call Analysis of streaming data. In other words, big data velocity deals with the speed at which data travels from various sources like machines, business processes, networks, mobile devices, social media sites, etc. And, the flow of data from these sources is gigantic and constant, which needs to be stored and processed quickly, and this is not possible with traditional data processing applications.

Velocity of Big Data

  4.Veracity: The data points which have been collected & stored from various sources, in different forms, often deals with inaccuracy. Under this we’ve to deal with poor quality of data, also in huge volumes (say for example: Twitter posts with hash tags, typos, abbreviations and colloquial speech) which is not precise and uncertain. But, big data and analytics technology allows us to work with these types of data.

Veracity of Big Data   

5.Value: Whether the data is big or little, no matter generated from anywhere in whatever format, should have some value – means we can properly utilize the data at its right cause for its validness. The significance, worth, or functionality of the data to those consuming it is presumably the most pertinent to various firms or organizations. As, we’re aware that data in itself has no importance or utility, but still we need valuable data to get the information.

Value of Big DataOn the ending note, all these ‘V’s of Big Data’ are discussed in any Big Data Hadoop training. Hope you find it interesting!

Big data and Hadoop facts

Big data means a lot of data. The experts say, big data fits one or more of four Vs of big data, namely, volume, velocity, veracity and variety. We are living in the age of big data and the factors mentioned ahead prove this fact to some extent.

Over 90% of all the data in the world was created in the past 2 years. And, it is expected that by the year 2020 the amount of digital information in existence will have grown from 3.2 zettabytes to 40 zettabytes. The total amount of data being captured and stored by industry doubles every 1.2 years. In two days we create as much information as we did from the beginning of time until 2003.

So, all of these trending threats about big data gave birth to the requirement of having a system which can handle big-data and analyze it at a fast rate. And, this is how Hadoop came into existence, although there were many system/frameworks which were being used or are still used for handling big data.

Big Data has been around for a long time, in fact, you can handle high volumes of data with massively parallel-processing (MPP) databases, such as those offered by Greenplum, Aster Data and Vertica. And, they’re incorporating Hadoop into these platforms.

Hadoop is the distributed file system which is nothing but the way to create clustered or distributed storage and can run on any server. HDFS is fast, secure, and fault tolerant.

MapReduce is actually the core of Hadoop which can put all the data nodes to process the data locally, and is fast and very powerful.

Hadoop is not actually an analytic platform; it can be used with traditional analytic platform or a common way to analyze the data we use R programming language to write our MapReduce jobs.

Hadoop can also be used for archiving and for ETL that stands for extracting, transform, and load. Moreover, Hadoop can also be used for filtering. The Hadoop platform provides many opportunities for transforming and extracting the data and processing.

Scaling of data is the major concern in the data world. The Hadoop system uses Accumulo for scaling the data. Accumulo is actually inspired from Google big table design and is built on the top of Hadoop. It comes with a few improvements in big table, for example, it provides cell-based access control and a server side programming. Also, in Accumulo the key-value pair at the various points can be modified in the process of data management.

Components of Hadoop

Hive: Apache Hive is a data warehouse application and provides high level language for expressing data analysis programs. It provides SQL like environment

PIG: Apache PIG provides high level language for expressing large datasets. PIG’s language consist of textual language called Pig Latin.

Click here to know more about Big Data Hadoop Training Course

Top 10 Repositories of Big Data

Repositories is a term which is used for storage location where we can store things retrieve things. The term repositories of big data means where the big data is stored and we can use that big data for big data analytics, mining etc. SO, there is no need to build their own massive data repositories before starting with big data analytics. So, these are the best data sources available or we can say these are the top 10 repositories for big data.

Google trends http://www.google.com/trends/explore

Google Trends is a public web facility provided by Google Inc., based on Google Search. Google Trends shows how often a specific search-term is entered relative to the total search-volume across various regions of the world. Google trends shows the result in the form of graph. The horizontal axis of the graph represents time, and the vertical axis represents how often a term is searched for relative to the total number of searches.

Data.gov http://data.gov

It is a U.S. government website for getting dataset, launched in late May 2009 by Vivek Kundra, the then Federal Chief Information Officer (CIO) of the United States. Data.gov stores all sorts of amazing information on everything like climate, business, education, agriculture etc.

Healthdata.gov https://www.healthdata.gov/

Healthdata.gov provides health-related data free. You can get comprehensive catalog of health-related data sets relevant to all aspects of health, available for free.

Facebook Graph https://developers.facebook.com/docs/graph-api

In most of the cases, the Facebook profile of any user is public. Facebook provide the Graph API as a way of querying the huge amount of information that the users want to share with the world.

Google Finance https://www.google.com/finance

Google Finance is a website launched on March 21, 2006 by Google Inc. based on Google Search. Google Finance provides updated real time stock data. Google Finance also aggregates Google News and Google Blog Search articles about each corporation.

New York Times http://developer.nytimes.com/docs

New York Times is a big data repository which provides searchable, indexed archive of news articles. It is an open source big data repository.

DBPedia http://wiki.dbpedia.org

Wikipedia provides millions of pieces of data on every subject under which exists in the world. DBPedia is a project to create a public, freely distributable database and anyone can analyze this data.

Google Books Ngrams http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

When you enter words into the Google Books Ngram, it search and analyze the full text of any of the millions of books digitized as a part of the Google Books project.

Amazon Web Services public datasets http://aws.amazon.com/datasets

Amazon Web Services hosts various public data sets that anyone can access for free. The data sets on Amazon Web Services are hosted in these two possible formats, Amazon Elastic Block Store snapshots and/or Amazon Simple Storage Service buckets.

The CIA World Factbook https://www.cia.gov/library/publications/the-world-factbook/

The CIA world Factbook is a factbook that provides the facts on the history, population, economy, government, infrastructure and military of 267 countries.

 

Click here to know more about Big Data Hadoop Training Course 

APACHE HBASE- Growing popularity in Industry

Apache HBase is an open-source, NoSQL database built on top of hadoop distributed file system. It is a column-oriented database which provides storage and quick access to large quantities of data. It is modeled after Google’s big table where it deals with huge volumes of data tables. It also allows users to perform insert, update and delete operation.

HBase, which was a sub-project within Apache Hadoop project, is now being used to provide real-time read and write access to big data.

Important Features of Apache HBase:

  • Data: can deal with any type of data whether structured, semi-structured or unstructured.
  • Tables: Sparsely populated tables.
  • Scalability: Horizontal Scalability, which adds servers to increase capacity.
  • SQL access: One can Query data interactively.
  • Schemas: Flexible schemas where users can add columns on the fly.
  • High availability: Having multiple master nodes ensure continuous access to data.
  • Full consistency: which guards against node failures or simultaneous writes to the same record.
  • Automatic sharding: which transparently and efficiently scale out your data across machines in the cluster.
  • Security: which secures table and column family-level access via Kerberos.

Stable release       – 0.98.4, 21 JULY 2014

Website                 – hbase.apache.org

          Written in              – Java

          License                 – Apache License 2.0

Working with HBase:

It uses Log Structured Merge trees (LSM trees) mainly to store and query the data. It deals with, compression, in-memory caching, bloom filters, and very fast scans. Also HBase tables can serve as both the input and output for MapReduce jobs.

Top users of Apache HBase:

  • Adobe currently has about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development.
  • Facebook uses HBase to power their Messages infrastructure.
  • Twitter runs HBase across its entire Hadoop cluster.
  • Yahoo uses HBase to store their document fingerprint for detecting near-duplications and they have a cluster of few nodes that runs HDFS and mapreduce.
  • Stumbleupon use HBase as a real-time data storage and analytics platform.
  • Filmweb have just started a small cluster of 3 HBase nodes mainly to handle their web cache persistency layer.
  • OpenLogic stores all the world’s Open Source packages, files and lines of code in HBase for both analytical and near real-time access purposes.

And many more – http://wiki.apache.org/hadoop/Hbase/PoweredBy

When not to use HBase?

  • When you’re dealing with few thousands of rows.
  • When you have hardware which is less than 5 DataNodes.
  • When you’re dealing with cross record-transactions or joins.

Column-family in HBase:

  • Table Schema only defines its column-families.
  • Columns in Apache Hbase are grouped into column-families.
  • All column members of a column-family have the same prefix.
  • Physically, all column-family members are stored together on the filesystem.

Three major Components of Hbase:

  • HbaseMaster: It stores all the metadata such as how table is splitted.
  • HRegionServer: Splitted table are stored in region servers.
  • HbaseClient: Client which are connect to master server and region servers.

Data Model in Hbase:

  • Hbase is a key-Value Store.
  • Values are stored in multi-dimensional format.
  • Data Model with Multi-Dimensional Columns.
 Data Model in Hbase Works
Check How Data Model in Hbase Works

Conclusion: The goal of this blog is to introduce you about Apache HBase, it uses and it’s structure. In my upcoming blog we’ll be looking ahead with the implementation of HBase table in hadoop.

Click here to know more about Big Data Hadoop Training Course