Native IT, workloads and elasticity

Over the next few weeks, I will share my perspective on current best practices in big data, which is the term I will use to combine thinking about analytical data systems: data lakes, data warehouses, stores operational data from data stores. During this trip, I will consider how analytical workloads evolve with AI and machine learning, discuss data architecture and virtual database technology, preview new hardware technologies (memory and processor) and, more importantly, will review the implications of cloud computing for kit and kaboodle.

In this article, I need to start laying the groundwork for a discussion on the cloud. We will see how scalable cloud computing makes performance “free”, then we will see how dedicated resources increase efficiency and further reduce costs. The next article builds on these concepts to describe where cloud database products will evolve.

To begin, let’s describe a workload that runs three ETL scripts and consider the result when the three scripts run as separate workloads. Imagine an ETL batch job. Batch jobs are isolated. They read data from one or more source systems, perform a series of integration steps as programs, and then load the results using a utility for loading data into a lake, warehouse, or store. By “isolated”, I mean that the computing resources required for the integration steps do not need to interact with other systems.

If you had a dedicated server to run a single ETL script, as long as it could read the raw source data and the reference data required for integration, there would be no need for connectivity to other systems . With all the ETL data and scripts in hand, the process could run autonomously. If you need to run two ETL scripts at the same time, you can deploy the software on two separate sets of servers; and three scripts could run on three servers or clusters of individual servers. As long as you replicate the ETL software and the required data each time, there will be no problem with a number of separate ETL systems.

In a cloud environment, you can easily run three separate clusters, run scripts, and slow them down, paying only for what you use. The ability to dynamically acquire and release resources in the cloud is called “elasticity.” This is a feature of native cloud applications and not a feature of any application running in the cloud. In other words, if you design your ETL software to be self-contained and deploy it using a cloud operating system that manages resources, you can take advantage of the elasticity of the cloud. Tools such as Docker and Kubernetes containers make this possible.

To continue, imagine that the three ETL jobs run on large datasets overnight, and that they each take three hours to complete on a dedicated cluster of twenty-four servers. If the three jobs run simultaneously, the three jobs end in twelve hours. This estimate assumes that 25% of the time, the three tasks compete for server computing resources. If the work is related to the processor, this would be an optimistic assumption and the execution time could be longer.

Scripts run overnight to access dedicated resources. During the day, the cluster executes queries, and conflicts between batch scripts and requests for the processor are difficult to manage. Finally, imagine that the cost of these twenty-four cloud servers is $ 4 per server per hour with software or $ 1,152 per day to run the three ETL scripts, not counting the costs of the storage server ($ 4 / server per hour times 24 times per server) 12 hours equals $ 1152).

If our ETL programs are scalable, we could run twice as many servers and complete the work in 6 hours. Note that the cost is still $ 1152 ($ 4 / server * 48 * 6 = $ 1152) and we could double it again to finish the job in 3 hours at the same price ($ 4 96 * 3 = $ 1152). These calculations continue as far as you want as long as your cloud provider allows you to pay in smaller and smaller increments.

This example shows the first important point: if you have autonomous and scalable workloads, you can scale in the cloud to reduce execution times at no additional cost.

Now let’s see what happens if we run each script on a separate cluster. With dedicated servers, each task takes three hours and the cost per task is $ 4 / server multiplied by 24 servers * 3 hours or $ 288. If we run 72 servers and run each script as a separate process, all three finish in three hours for $ 864. The savings are the result of removing the discord between the three jobs and giving each job dedicated resources.

While this may seem obvious, we are so used to sharing computers that we forget that conflict is a waste. Whether we are fighting for a disk drive to read or write, for memory, for the CPU cache (L3, L2 and L1), for retrieving or executing instructions, the cost of conflict management adds inefficiencies. . More on this in a few articles, I want to talk about how databases can reduce conflict, how processor technology helps, and most importantly how technology like Intel Optane can play a role in the future.

 

Let me conclude with a few caveats regarding this invented scenario.

  • First, if the scripts are related to I / O and not to the CPUs, they can run with fewer conflicts. ETL programs that broadcast data between steps will be bound to the processor because they do not perform I / O to spool the intermediate results. The claim will still be there, and the cost will take it into account. If the jobs are more fully bound to the processor when the scripts run in memory, the contention will be greater and the cost difference will be higher.
  • Second, there are start-up costs associated with cloud clusters. The rotation of a machine will take several minutes, and if there are more servers, booting will be more expensive. We will consider this more in the next post.

 

So far we have highlighted two key points:

  • If we have a scalable system and an autonomous workload, we can deploy cloud computing on a large scale to reduce execution times at no additional cost. There is no reason to undergo long-term batch work.
  • If we have multiple units of work running, while in the past we could run them simultaneously and allow the workload to compete for a finite number of computers, with the cloud computing we can provide each discrete resource workload and scale at a reduced cost.

 

In the next article, we will discuss the smaller units of work in a database. With this foundation, we can then talk about the power provided by products like Snowflake, and we can show a way for cloud databases to become even more efficient.

Leave A Comment

Whatsapp Whatsapp Skype