You’re entering the 3rd phase of Big Data. Do you have the right infrastructure?

It’s been a little over ten years since Hadoop was born out of work at Yahoo.  Initially, early adoption was driven by scattered groups with research projects using MapReduce, Hive, Hbase.  Then came department-level workloads for business intelligence and insights.  This drove the second phase and the need for real-time processing, mixed workloads and tools like YARN and Spark.  For Spark, customers stuffed more DRAM into their existing Hadoop clusters but otherwise relied on existing SATA hard-drives for storage.  We’re now entering the third phase - the Enterprise phase - where big data is available as-a-service across business units and departments.  The workloads span the spectrum from short-term ephemeral clusters to long-lived, shared data lakes; from analytics to transaction processing to hybrid processing.  And as Spark replaces MapReduce, SSDs are increasingly deployed as customer realize that DRAM-alone doesn’t deliver complete performance.

Picture1.png

In the meantime, hardware innovation has accelerated.  Clusters deployed circa 2012 - when Spark was introduced - can now be replaced, upgraded or expanded with new scalable building blocks from Intel, Supermicro and Attala Systems.

Picture1.png

Existing Clusters can now be replaced, upgraded or expanded with superior capability with a fraction of the footprint, power consumption, TCO

But beyond just density, Attala Systems, Supermicro and Intel have collaborated to create a solution that addresses the third phase, the Enterprise phase.  The Attala highly-available and high-performance storage solution is based on an Intel FPGA-based scale-out fabric that leverages standard Ethernet to connect the big data workloads to Intel NVMe SSDs residing in the 1U JBOF with minimal added latency or costs.  The solution enables Big Data Phase-3:

  • Delivers near-native NVMe SSD performance for Spark with ability to scale capacity separate from compute
  • For multi-tenancy or big-data private clouds (or on-prem part of hybrid), the ability to “carve and isolate” the scale-out SSD capacity between multiple users or mixed workloads (whether in bare-metal, KVM virtual machines or Kubernetes containers) with hardware-enabled QoS isolation
  • The ability for “parallel data sharing” of stored data to enable compute node HA/failover or to enable parallel compute instances access to common data
Picture1.png

At the 2017 Supercomputer Conference in Denver, the three companies show how this solution is used for two commonly used HPC cluster compute and file solutions:

  • Multi-tenant Spark – To support multiple users (departments, test/dev, QA, production) and the need to spin-up instant clusters, virtualization (e.g. KVM) solves half the problem by avoiding need to dedicate CPU/memory for different apps.  Attala Storage solves the other half via the ability to instantly provision, “carve and isolate” Intel NVMe SSD resources for multiple users.  The result is an unprecedented multi-tenant solution for instant-clusters with lower cost & better-than-bare-metal performance
  • Lustre for small file performance - While most data reside in big files, most files (70%) are small with sizes smaller than the default Lustre 1MB stripe size.  The Lustre 2.11 increases speeds up small file performance – by up to 4X - via Data-on-MDT (Meta Data Target) on same storage NVMe based tests.  The performance and high-availability are enabled via Attala Storage and Intel® Optane™ SSD DC P4800X Series with “parallel data sharing” to enable Lustre fail-over and scale-out capability

Attala’s composable storage infrastructure solution will be demonstrated with Spark in the Intel Systems Booth (Booth 1301), while the Attala solution with Lustre will be in the SuperMicro Booth (Booth 1611) at SC2017, November 14th-16st, 2017 at the Colorado Convention Center, Denver Colo.

 

Taufik Ma