New data-intensive applications like data analytics, artificial intelligence and the Internet of things are driving huge growth in enterprise data. With this growth comes a new set of IT architectural considerations that revolve around the concept of data gravity. In this post, I will take a high-level look at data gravity and what it means for your enterprise IT architecture, particularly as you prepare to deploy data-intensive AI and deep learning applications.
What is data gravity?
Data gravity is a metaphor introduced into the IT lexicon by a software engineer named Dave McCrory in a 2010 blog post.1 The idea is that data and applications are attracted to each other, similar to the attraction between objects that is explained by the Law of Gravity. In the current Enterprise Data Analytics context, as datasets grow larger and larger, they become harder and harder to move. So, the data stays put. It’s the gravity — and other things that are attracted to the data, like applications and processing power — that moves to where the data resides.
Why should enterprises pay attention to data gravity?
Digital transformation within enterprises — including IT transformation, mobile devices and Internet of things — is creating enormous volumes of data that are all but unmanageable with conventional approaches to analytics. Typically, data analytics platforms and applications live in their own hardware + software stacks, and the data they use resides in direct-attached storage (DAS). Analytics platforms — such as Splunk, Hadoop and TensorFlow — like to own the data. So, data migration becomes a precursor to running analytics.
As enterprises mature in their data analytics practices, this approach becomes unwieldy. When you have massive amounts of data in different enterprises storage systems, it can be difficult, costly and risky to move that data to your analytics clusters. These barriers become even higher if you want to run analytics in the cloud on data stored in the enterprise, or vice-versa.
These new realities for a world of ever-expanding data sets point to the need to design enterprise IT architectures in a manner that reflects the reality of data gravity.
How do you get around data gravity?
A first step is to design your architecture around a scale-out network-attached storage (NAS) platform that enables data consolidation. This platform should support a wide range of traditional and next-generation workloads and applications that previously used different types of storage. With this platform in place, you are positioned to manage your data in one place and bring the applications and processing power to the data.
What are the design requirements for data gravity?
Here are some of the high-level design requirements for data gravity.
Security, data protection and resiliency
An enterprise data platform should have built-in capabilities for security, data protection and resiliency. Security includes authorizing users, authenticating them and controlling/auditing access to data assets. Data protection and resiliency involves protecting the availability of data from disk, node, network and site failures.
Security, data protection and resiliency should be applied across all applications and data in a uniform manner. This uniformity is one of the advantages of maintaining just one copy of data in a consolidated system, as opposed to having multiple copies of the same data spread across different systems, each of which has to be secured, protected and made resilient independently.
The data platform must be highly scalable. You might start with 5 TB of storage and then soon find that you are going to need to scale to 50 TB or 100 TB. So, look for a platform that scales seamlessly, from terabytes to petabytes.
That said, you need to choose platforms where personnel and infrastructure costs do not scale in tandem with increases in data. In other words, a 10x increase in data should not bring a 10x increase in personnel and infrastructure costs; rather, these costs should scale at a much lower rate than that of data growth. One way to accomplish this goal is through storage optimization.
The data platform should enable optimization for both performance and capacity. This requires a platform that supports tiers of storage, so you can store the most frequently accessed data in your faster tiers and seldom-used data held in lower-cost, higher-capacity tiers. The movement of data within the system should happen automatically, with the system deciding where data should be stored based on enterprise policies that could be configured by the users.
Data Analytics and AI platforms change often, so data must be accessible across different platforms and applications, including those you use today and those you might use in the future. This means your platform should support data access interfaces most commonly used by data analytics and AI software platforms. Examples of such interfaces are NFS, SMB, HTTP, FTP, OpenStack Swift-based Object access for your cloud initiatives, and native Hadoop Distributed File System (HDFS) to support Hadoop-based applications.
These are some of the key considerations as you design your architecture to address data gravity issues when deploying large-scale data analytics and AI solutions. For a closer look at the features of a data lake that can help you avoid the architectural traps that come with data gravity, explore the capabilities of Dell EMC data analytics platforms and AI platforms, all-based on Dell EMC Isilon, the foundation for a consolidated data lake.
 The Register, “We’ve heard of data gravity – we’re just not sure how to defy it yet,” January 2, 2018. See also Dave McCrory, “Data Gravity – in the Clouds,” December 7, 2010.