Baidu is the company that built and maintains China’s largest search engine. As the overwhelming mass of data in their servers has grown, search query times have increased to be as high as several hours. Desperate for a solution, they have invested heavily in alternatives that could perform the same queries in under 30 seconds.
Their Trials Have Pointed Out Several Promising Approaches to the Issue
The primary issues preventing efficiency were quickly identified. First, they eliminated MapReduce. This is an open source framework that was developed by Google and is included in Apache Hadoop. Simply put, this framework could not handle the quantity of data that needed to be processed. Tests using Spark SQL, which runs queries against Apache Spark, produced exponentially better times, but they still averaged around 10 minutes. This lead to the discovery of the other major problem.
The second primary issue came down to networking. Because Baidu housed data across multiple sites, each query required transfers that created too much network stress that were ultimately responsible for the long delays. An attempt at a solution lead them to a Berkeley research project that was named Tachyon.
It has since been renamed to Alluxio, and it works by creating a virtual storage pool. In essence, remotely connected storage devices or systems can be virtually combined and treated as a single unit. This allows frequently accessed data to be stored on the computing nodes that need it, dramatically reducing the volume of data transferring between sites and improving query times accordingly.
Combining Spark SQL with Alluxio reduced the search times to an average of 10 to 15 seconds, and these numbers apply to remotely stored data. While Baidu has not fully deployed Alluxio, they are intending to rely on it heavily. Initially they have focused on changing systems that handle images and image analysis.
When this move is complete, a significant portion of their system’s load will be optimized, saving large sums on future operation and development costs. Other major data centers can be similarly improved by targeting the most frequently used and data intensive stores first.
Many Other companies are Researching Applications of Alluxio
Including IBM, Intel, Alibaba Group, Barclays, EMC and Pivotal. With so much invested, the benefits of the single system created by Alluxio enables users to optimize the arrays. Storage can be tiered by different components. For example, while the top tier uses in-memory storage, a second tier can be assigned to flash drives while a third goes to platter drives.
This enables users of all varieties to build cost effective storage systems tailored to their functions. It is also worth noting that Alluxio can work with any framework; it is not limited to Spark.
Each year it becomes increasingly clear that storage options are the pitfall of data management. Revolutions in storage handling are pivotal to keep up with the pacing of other technological improvements. Further developing ways to implement and deploy Alluxio through massive server systems holds exciting promise and may be the key for servers to keep up with demand in coming years.
Katrina is a product specialist with the leading server rack engineering and accessories company click HERE to visit the site.