April | 2015 | Big Data & Hadoop

Apache HBase is an open-source, NoSQL database built on top of hadoop distributed file system. It is a column-oriented database which provides storage and quick access to large quantities of data. It is modeled after Google’s big table where it deals with huge volumes of data tables. It also allows users to perform insert, update and delete operation.

HBase, which was a sub-project within Apache Hadoop project, is now being used to provide real-time read and write access to big data.

Important Features of Apache HBase:

Data: can deal with any type of data whether structured, semi-structured or unstructured.
Tables: Sparsely populated tables.
Scalability: Horizontal Scalability, which adds servers to increase capacity.
SQL access: One can Query data interactively.
Schemas: Flexible schemas where users can add columns on the fly.
High availability: Having multiple master nodes ensure continuous access to data.
Full consistency: which guards against node failures or simultaneous writes to the same record.
Automatic sharding: which transparently and efficiently scale out your data across machines in the cluster.
Security: which secures table and column family-level access via Kerberos.

Stable release – 0.98.4, 21 JULY 2014

Website – hbase.apache.org

Written in – Java

License – Apache License 2.0

Working with HBase:

It uses Log Structured Merge trees (LSM trees) mainly to store and query the data. It deals with, compression, in-memory caching, bloom filters, and very fast scans. Also HBase tables can serve as both the input and output for MapReduce jobs.

Top users of Apache HBase:

Adobe currently has about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development.
Facebook uses HBase to power their Messages infrastructure.
Twitter runs HBase across its entire Hadoop cluster.
Yahoo uses HBase to store their document fingerprint for detecting near-duplications and they have a cluster of few nodes that runs HDFS and mapreduce.
Stumbleupon use HBase as a real-time data storage and analytics platform.
Filmweb have just started a small cluster of 3 HBase nodes mainly to handle their web cache persistency layer.
OpenLogic stores all the world’s Open Source packages, files and lines of code in HBase for both analytical and near real-time access purposes.

And many more – http://wiki.apache.org/hadoop/Hbase/PoweredBy

When not to use HBase?

When you’re dealing with few thousands of rows.
When you have hardware which is less than 5 DataNodes.
When you’re dealing with cross record-transactions or joins.

Column-family in HBase:

Table Schema only defines its column-families.
Columns in Apache Hbase are grouped into column-families.
All column members of a column-family have the same prefix.
Physically, all column-family members are stored together on the filesystem.

Three major Components of Hbase:

HbaseMaster: It stores all the metadata such as how table is splitted.
HRegionServer: Splitted table are stored in region servers.
HbaseClient: Client which are connect to master server and region servers.

Data Model in Hbase:

Hbase is a key-Value Store.
Values are stored in multi-dimensional format.
Data Model with Multi-Dimensional Columns.

Conclusion: The goal of this blog is to introduce you about Apache HBase, it uses and it’s structure. In my upcoming blog we’ll be looking ahead with the implementation of HBase table in hadoop.

Click here to know more about Big Data Hadoop Training Course

Big Data & Hadoop

Blogs On Big data And Hadoop

Month: April 2015

APACHE HBASE- Growing popularity in Industry