In the IT world, clearly big data has attracted a lot of attention from various sectors and the last two years have seen a significant investment in various big data projects and many more startups. According to a new market report published by Transparency Market Research, the global big data market was worth USD 6.3 billion in 2012 and is expected to reach USD 48.3 billion by 2018. In 2014, one will see companies reporting the benefits of their big data projects and further implementing analytics applications on top of big data infrastructure. Bottom line is big data hype may subside but real time project implications and value for the investments would surface in this year. Given that the implementation of projects by various firms is going to continue, the demand for big data talent will continue to rise and this career path is sure to be a challenging and rewarding one.
Amongst the existing big data technologies, Apache Hadoop and its various components are the most popular management solutions for handling big data. Since these technologies are specifically designed to handle massive amounts of data using distributed computing framework, there is a huge demand for the right kind of big data talent. Typically a big data analyst should have good knowledge on MapReduce programming to query and analyze data sitting in the big databases such as Hadoop. Java is the most popular language for executing MapReduce programs on Hadoop and other alternatives which exist are Hive, Pig etc. One can also use other languages such as R, Python, Ruby, Perl, C++ and more to execute MapReduce programs on Hadoop. These along with Hive and Pig are considered as non-Java big data languages in order to query the Hadoop database.
Amongst all the programming options that exist for Hadoop; Java, Hive and Pig are considered as more native by design. Java is the most mature of all the options since the Hadoop ecosystem is developed on the Java framework. Although non-Java alternatives exist, one will find that most code examples and supporting packages available online are Java-based. Unlike Java, Hive is a SQL-like language that is interpreted on top of Hadoop. It is majorly a data warehousing Hadoop component used for performing higher-level abstractions, joins and nested operations on data. Talking about Pig, it has a totally different programming structure and was developed by Yahoo. Majorly, it is used to analyze large data sets and also reduces the time spent in writing MapReduce programs using Java.
As mentioned earlier, other non-Java languages such as R, Python, Perl, C++ etc can also be used to write and execute MapReduce programs on Hadoop. Unlike Java, Pig and Hive; these languages make use of Hadoop streaming component which executes MapReduce programs that are written in non-Java languages. Broadly, MapReduce has two key components i.e. Map and Reduce where Map transforms a set of data into key value pairs and Reduce aggregates this data. Using streaming component, one can write both Map and Reduce components in non-Java language scripts and further execute the same on Hadoop database. Another non-Java language alternative named RHadoop project developed by Revolution Analytics provides powerful open-source tools to analyze data stored in Hadoop. RHadoop contains a R package called rmr2 which allows to write MapReduce programs using R syntax within Hadoop. Overall there are many languages and frameworks that sit on top of MapReduce, but there is no one single solution for big data handling and each has different strengths and weaknesses.