Read about old and new solutions on 150 pages of the magazine.
HADOOP Compendium CONTENT LIST:
Grokking the Menagerie: An Introduction to the Hadoop Software Ecosystem
by Blake Matheny
The Hadoop ecosystem offers a rich set of libraries, applications, and systems with which you can build scalable big data applications. As a newcomer to Hadoop it can be a daunting task to understand all the tools available to you, and where they all fit in. Knowing the right terms and tools can make getting started with this exciting set of technologies an enjoyable process.
Hands-on with Hadoop
by David Arthur
The Apache Hadoop ecosystem is home to a variety of libraries and back-end services that enable the storage and processing of vast amounts of data. At the lowest level there is a complex and robust Java API. Many higher level abstractions have come about in the past years. We will look at the Java API, Apache Hive, Apache Pig, and as an added bonus Apache Solr.
Data Collection and Management in the Era of Big Data
by Aaron Kimball
Conventional industries such as telecommunications, healthcare, retail, and finance as well as high-tech sectors like online gaming, video and media streaming and mobile applications are now beginning to accumulate data at web scale. To stay current, software engineers will need to gain expertise in skills for collecting and managing Big Data. In this article we will explore practical lessons in how to collect and manage large data from diverse data sources.
Introduction to MapReduce
by Rémy Saissy
Processing and querying massive amounts of data, indexing the web… We hear about Hadoop and MapReduce a lot these days. But how does MapReduce work? How does one read the output of a job? That is the discussion at hand.
Hadoop: solving problems with MapReduce
by Sergey Enin
A new type of computer science problems has appeared within the last 10–20 years – new technologies allow to access large amount of data. This raises a question – how to effectively store, process and analyze such huge data sets?
Data clustering using MapReduce: A Look at Various Clustering Algorithms Implemented with MapReduce Paradigm
by Varad Meru
Data Clustering has always been one of the most sought after problems to solve in Machine Learning due to its high applicability in industrial settings and the active community of researchers and developers applying clustering techniques in various fields, such as Recommendation Systems, Image Processing, Gene sequence analysis, Text analytics, etc., and in competitions such as Netflix Prize and Kaggle.
Writing Hive UDFs and serdes
by Alex Dean
In this article you will learn how to write a user-defined function (“UDF”) to work with the Apache Hive platform. We will start gently with an introduction to Hive, then move on to developing the UDF and writing tests for it. We will write our UDF in Java, but use Scala’s SBT as our build tool and write our tests in Scala with Specs2.
Distributed coordination with ZooKeeper
by Scott Leberknight
Consider a distributed system with multiple servers, each of which is responsible for holding data and performing operations on that data. This could be a distributed search engine, a distributed build system, or even something like Hadoop which has both a distributed file system and a Map/Reduce data processing framework that operates on the data in the file system.
Curator and Exhibitor: A better way to use and manage Apache ZooKeeper
by Jordan Zimmerman
Apache ZooKeeper is an invaluable component for serviceoriented architectures. It effectively solves the difficult problem of distributed coordination (i.e. locking, leader selection, etc.). However, it can be tricky to use correctly. Curator and Exhibitor were written to make using and operating ZooKeeper easier and less error prone.
Bucket Cache: A CMS, Heap Fragmentation and Big Cache on HBASE Solution
by Chunhui Shen
As a comparison to the current cache mechanism (LruBlockCache and SlabCache) on HBase, this article would introduce a new block cache called BucketCache. It could greatly decrease CMS and heap fragmentation by JVM GC and also supports a large cache space for high read performance by using a high speed disk like Fusion-io.
Hybrid approach to enable real-time queries to end users
by Benoit Perroud
Since it became an Apache Top Level Project early 2008, Hadoop has established itself as the de-facto industry standard for batch processing. The two layers composing its core, HDFS and MapReduce, are strong building blocks for data processing. Running data analysis and crunching petabytes of data is no longer fiction. But the MapReduce framework does have two major downsides: query latency and data freshness.
One of Alibaba.com’s Stories with Zookeeper
by Fuqiang Wang
Apache Zookeeper (aka. ZK) is a coordination service mostly for distributed systems. Most people would like to compare Zookeeper with Google’s Chubby, but I don’t think they are made exactly for the same purpose. Why is Zookeeper a coordination service while Chubby is usually called a distributed lock service?
Pattern, a Machine Learning Library for Cascading
by Paco Nathan
Pattern is a machine learning project within the Cascading API, which is used for building Enterprise data workflows. Cascading provides an abstraction layer on top of Apache Hadoop and other distributed computing topologies.
Direction of Hadoop Development
by Ted Yu
Coming back from HBT C 2012 (Hadoop and Big data Technology Conference), I was overwhelmed by the wide adoption of HBase in China. Prominent companies such as Taobao (Alibaba), Huawei, Intel and IBM are all endorsing HBase. In the US, HBase has actually become NoSQL and this trend is even more visible in China.
Apache Drill: Newcomer in the Hadoop Ecosystem. Apache Drill: Newcomer in the Hadoop Ecosystem
by Jacques Nadeau & Ted Dunning
Apache Drill is a new open source project that does ad hoc queries on a variety of data sources. Why Drill? The answer is simple: there is a need for speed. Hadoop is wildly popular and successful, but it’s based on batch processing and has high inherent latency. There is a need for ad hoc, real time query and analysis of large data sets, and that’s where the speed and flexibility of Drill comes in.
Evolution of Cassandra
by Edward Capriolo
When approached with the concept of writing an article about Cassandra, I thought about how long I have been using it for, and how much it has changed. I distinctly remember the first version of Cassandra I used was 0.6.0 which was released on April 20, 2010.
Interview with Jonathan Ellis
by Stefan Edlich
With the upcoming release of Apache Cassandra 1.2 we’re all curious about the changes and new solutions included in the version. Jonathan Ellis, CTO and co-founder of Datastax, explains what has been tweaked in an interview with Stefan Edlich.
Finding the right solution: Using the right tool for Big Data problems
by Eddie Satterly
In the current landscape of tools out there and the current multitude of problems technology professionals are asked to solve with big data: What is the right solution?
An Odyssey of Cassandra
by Eric Lubow
Not everyone starts out by processing hundreds of millions or even billions of events per day. In fact, most people never get to that point or even have the prospect of getting to that point. But in case you do, there are a few tools and tips to help you ingest, process, analyze, and store all that data.
Apache Cassandra Quick Tour
by Terry Cho
Cassandra is distributed database system. It is donated to Apache open source group by Facebook at 2008.The Cassandra is based on Google Big Table data model and Facebook Dynamo distributed architecture. It doesn’t use SQL and optimized to high scale size of data & transaction handling. Even though Cassandra is implemented with Java language, other language can use the Cassandra as a client. (It supports Ruby, Perl, Python, Scala, PHP etc).
Getting Started with Cassandra, using Node.js
by Russell Bradberry
Although Cassandra is classified as a NoSQL database, it is important to note that NoSQL is a misnomer. A more correct term would be NoRDBMS, as SQL is only a means of accessing the underlying data. In fact, Cassandra comes with an SQL-like interface called CQL. CQL is designed to take away a lot of the complexity of inserting and querying data via Cassandra.
Reporting with Cassandra
by Jose Avila
Quite commonly the need arises to analyze data over time and provide quick and easy access to those statistics via a dashboard. Providing access to real time stats can be difficult as the quantity of data being analyzed grows. This article will cover one potential solution for storing, and querying statistical data with Apache Cassandra for real time report generation.
Cassandra: internal storage
by Sergey Enin
Typically, there are a lot of questions from many developers about how internally Apache Cassandra storage works – what algorithms and data structures are implemented to assure so effective writes and not only writes, but also many others interesting features, so this article describes it.
Improving Cassandra’s Uptime with Virtual Nodes
by Richard Low
Cassandra is designed to run in large clusters where node failure could be a common occurrence. It may be a temporary network failure, or maybe the PSU has failed and someone has to go onsite to fix it. Whatever the cause or length of the issue, Cassandra can cope with the failure and end users will not notice.
COTS to Cassandra
by Christopher Keller
At NASA’s Advanced Supercomputing Center, we have been running a pilot project for about eighteen months using Apache Cassandra to store IT security event data. While it’s not certain that a Cassandra based solution will go into production eventually, I’d like to share my experiences during the journey.
Cassandra in the Real World: Migrating from a legacy database
by Jason Brown
At Netflix, we use Apache Cassandra for nearly all of our production data. However, the road to get to that point so was not simple nor straightforward, and I’d like to share a real world use case of migrating from a legacy RDBMS solution to Cassandra.
Cassandra Performance: An Ops Perspective. C* at Formspring.me
by Martin Cozzi
Being a startup and having grown to 1MM users in less than a month, AW S came as our weapon of choice when it came to trying things out, moving fast and failing even faster. The decision of moving to Cassandra came as we reached the limitation of our current infrastructure, and decided to plan for the worst: a Bieberapocalypse – what would happen if Bieber joined the site and suddenly got millions of followers.