Overview

Traditionally, data processing could be delineated into online operations (e.g. OLTP) versus batch/offline analytics (e.g. data support systems.  But with real-time digital businesses, the last decade has seen the emergence of real-time analytics with diverse uses across industries.  For example, risk and profitability analysis in the Financial property trading sector, real-time ads/marketing in eCommerce, fraud prevention, real-time security and more recently, the real-time analytics required on the IoT data flows.

The data processing frameworks that support real-time analytics place high demands on the underlying platform.   It must enable both a high number of transactions per second and more importantly, low latency across multiple operations– for example – to enable responding to an e-shopper before they lose patience, or to make a stock trade in time, or to block a security threat before it occurs.  The platform must also support constant change both in the type and amount of data as well as ongoing changes to the workload as developers seek to innovate on their services.

The data processing frameworks that support real-time analytics place high demands on the underlying platform.   It must enable both a high number of transactions per second and more importantly, low latency across multiple operations– for example – to enable responding to an e-shopper before they lose patience, or to make a stock trade in time, or to block a security threat before it occurs.  The platform must also support constant change both in the type and amount of data as well as ongoing changes to the workload as developers seek to innovate on their services.          

Benefits

Versus direct-attach SSDs

  • Support data growth on-demand via “per-node”, instantly scalable storage
  • Reduce scale-out costs or painful re-sharding when all you need is more storage capacity
  • Improve HA: Reduce rebuild/resync times when compute nodes fail
  • Improve HA: prevent a single SSD failure from taking out an entire compute node [true when there’s no local RAID?]
  • Reduce costs by avoiding overprovisioning
  • Reduce costs by scaling capacity as you grow

Versus other networked storage

  • Near-native NVMe SSD performance, latency with predictable performance
  • Higher transactions per second; Faster user or transaction response
  • Lower CPU and database licensing costs
  • Reduce costs by supporting multiple scale-out database nodes with fewer storage nodes

Challenges

Admins or operators face painful trade-offs when implementing their databases with direct-attached-storage versus networked-storage

Direct-attached-storage.  Great performance but…

  • Scaling-out for wrong reasons: Whether with Structured, Columnar or NoSQL databases, admins do have the option of supporting data growth via scaling-out compute-nodes with direct-attach-storage.  But scaling out triggers painful dataset re-sharding or re-syncing and sometimes with operational downtime.  Even for databases with auto-sharding, scaling out incurs additional compute costs when oftentimes, only storage capacity is required – especially since scale-out typically occurs in replica-sets of three nodes.
  • Cost of overprovisioning: The brute-force solution is to overprovision – e.g. create racks of servers that are packed to the gill with up to two dozen high-end SSDs per server in anticipation of future dataset growth.  Of course, that’s not affordable. 
  • Long rebuild/resync times: When a node fails, scale-out databases recover by rebuilding using a spare node.  Long rebuild times get even longer (up to hours) with more direct-attached-storage as driven by the overprovisioning described above.  During rebuilds, operational performance deteriorates as an active “donor” node uses up performance and the network is saturated by Terabytes of rebuild traffic

 

Existing networked storage.  Good flexibility but…

  • Poor database performance and response times:  Existing solutions squander the native performance and latency of the SSDs that they contain.  Whereas an SSD’s latency is typically below 100 usec, existing solutions incur latencies of multiple 1000’s of usec’s.  And also dilute IOPS by up to 100X.   For virtualized database nodes, further performance dilution occurs in the hypervisor layer as storage is virtualized – e.g. as scsi vdisks presented to a VM.
  • High database licensing costs: An operator can compensate for low database performance caused by low storage performance with more CPU cores and memory.  But because software licensing is oftentimes tied to CPU cores, licensing fees can be substantial
  • Excessive costs to support database clusters: Even if storage latency isn’t a factor, networked-storage solutions still need sufficient IOPS to support multiple compute nodes within multiple clusters – e.g. twenty MongoDB nodes and ten MySQL nodes.  Even with the advent of high-capacity SSDs (up to 16TB), existing networked-storage solutions – with their poor IOPS - require multiple storage nodes to support these clusters.
  • Unpredictable database performance and response times – it’s a familiar story, the application and database performed well in test and dev, but when deployed in the production environment with multiple clusters and nodes sharing the networked-storage, performance issues and finger pointing abound.  Many existing solutions lack QoS capabilities to ensure performance on a per-host or per-volume basis.  And almost none include the capability to monitor the actual performance.  As a result, the operator suffers from lengthy root-cause and time-to-innocence fire-drills while the business suffers.

Solution

Attala creates the ideal platform for databases and real-time analytics via the following critical capabilities

  • Hardware-based, networked SSD resources
    • To preserve the performance of locally installed SSDs, the Attala solution implements an end-to-end, 100% hardware path – from the server, across the network, to the SSDs and back.  Databases running on the server will be hard pressed to discern any performance difference versus local SSDs with additional latency of only 5 usec; this enables total access latencies below 20 usec (Intel Optane or Samsung Z-nand SSDs) or below 100 usec (typical 3D Nand SSDs).  The same efficient hardware enables a single 2RU data nodes to deliver up to 10M IOPS which creates the performance-density necessary to support multi-cluster, multi-node databases
    • For virtualized databases, the storage above is delivered directly to the VM – bypassing any performance dilution in the hypervisor - via virtual SSDs and their use of SRIOV technology.
  • Resource automation & persistent volumes
    • Either the user or operator uses the Attala GUI or API to provision or add SSD resources for a given database node or cluster – either preboot or runtime.  During provisioning, the user/operator does not need to be aware of the available pool inventory; SSD resources are autonomously selected and bound to the server.  If a database compute node fails, its volume can be re-attached to spare nodes to avoid length node rebuilds.  An example of “ridiculously fast mongodb replica recovery” can be found here.
    • A separate operator interface - including alerts - are provided to ensure that the SSD resource pool continues to have sufficient performance and capacity headroom for growth.  To control costs, the operator can expand capacity only when needed based on this complete visibility into pool utilization.
  • Performance Automation and QoS
    • Volumes are allocated to a database node or cluster along with hardware-enforced QoS parameters that ensure that the database gets consistent, predictable performance.  The solution eliminates any noisy-neighbor concerns.  Furthermore, the solution includes hardware-based performance monitoring on each volume and that information is available to both the operator and database developer to minimize crippling war-rooms when application performance puts a business on hold.