[image]
Ph: 19216810

Driving Storage Costs Down for AWS Customers

| Comments ()

One of the things that differentiates Amazon Web Services from other technology providers is its commitment to let customers benefits from continuous cost-cutting innovations and from the economies of scale AWS is able to achieve. As we showed last week one of the services that is growing rapidly is the Amazon Simple Storage Service (S3).

[image]

AWS today announced a substantial price drop per February 1, 2012 for Amazon S3 standard storage to help customers drive their storage cost down. A customer storing 50TB will see on average a 12% drop in cost when they get their Amazon S3 bill for February. Other storage tiers may see even greater cost savings.

These Amazon S3 cost savings will also help drive down the cost of Amazon EBS snapshots and Amazon Storage Gateway snapshots, for example in the US East (Virginia) Region, their cost will drop from $0.14 to $0.125 per Gigabyte.

In a time where on-premise infrastructure costs are rising significantly it is great to see that AWS can let all of its customers, big and small, benefit from the cost cutting innovations in storage.

More details can be found in the Forum Announcement, on Jeff Barr's blog and on the Amazon S3 Pricing Page.

Expanding the Cloud - The AWS Storage Gateway

| Comments ()

Today Amazon Web Services has launched the AWS Storage Gateway, making the power of secure and reliable cloud storage accessible from customers’ on-premises applications.

We have been working closely with our customers on their requests to bring the power of the Amazon Web Services cloud closer to their existing on-premises compute infrastructures. The Amazon Virtual Private Cloud extends on-premises compute with all the power of AWS, making it elastic, scalable and highly reliable. AWS Identity and Access Management brings together on-premises and cloud identity management. VM Import allows our customers to move virtual machine images from their datacenters to the Cloud and Amazon Direct Connect makes the network latencies and bandwidth between on-premises and AWS more predictable. With the launch of the AWS Storage Gateway our customers can now integrate their on-premises IT environment with AWS’s storage infrastructure.

The AWS Storage Gateway is a service connecting an on-premises software appliance with cloud-based storage. Once the AWS Storage Gateway’s software appliance is installed on a local host, you can mount Storage Gateway volumes to your on-premises application servers as iSCSI devices, enabling a wide variety of systems and applications to make use of them. Data written to these volumes is maintained on your on-premises storage hardware while being asynchronously backed up to AWS, where it is stored in Amazon S3 in the form of Amazon EBS snapshots. Snapshots are encrypted to make sure that customers do not have to worry about encrypting sensitive data themselves. When customers need to retrieve data, they can restore snapshots locally, or create Amazon EBS volumes from snapshots for use with applications running in Amazon EC2.

[image]

Here are three example use cases that we envision for the AWS Storage Gateway. The first one is using the AWS Storage Gateway to back up your data to Amazon S3’s highly reliable storage environment. Amazon S3 is designed to sustain the concurrent loss of data in two facilities, redundantly storing your data on multiple devices across multiple facilities in an AWS Region. So, backing up your data to Amazon S3 means a lot less headaches worrying about your local storage environment.

The second use case is where customers want to move data between local infrastructure and the Amazon Web Services cloud to provide access to applications and other computations running in Amazon EC2. The use of the Amazon EBS snapshot format means the data that was on-premises can be restored as an Amazon EBS volume mounted to an Amazon EC2 instance.

The third use case, cloud-based Disaster Recovery, is a specific variation of the previous two. If there is a failure in your local infrastructure, you can quickly launch a DR environment in Amazon EC2 which will have full access to the data snapshots backed up into Amazon S3 by the AWS Storage Gateway.

For more information on the AWS Storage Gateway, you can visit the detail page Jeff Barr over at the AWS Developer Blog has more details.

Today is a very exciting day as we release Amazon DynamoDB, a fast, highly reliable and cost-effective NoSQL database service designed for internet scale applications. DynamoDB is the result of 15 years of learning in the areas of large scale non-relational databases and cloud services. Several years ago we published a paper on the details of Amazon’s Dynamo technology, which was one of the first non-relational databases developed at Amazon. The original Dynamo design was based on a core set of strong distributed systems principles resulting in an ultra-scalable and highly reliable database system. Amazon DynamoDB, which is a new service, continues to build on these principles, and also builds on our years of experience with running non-relational databases and cloud services, such as Amazon SimpleDB and Amazon S3, at scale. It is very gratifying to see all of our learning and experience become available to our customers in the form of an easy-to-use managed service.

Amazon DynamoDB is a fully managed NoSQL database service that provides fast performance at any scale. Today’s web-based applications often encounter database scaling challenges when faced with growth in users, traffic, and data. With Amazon DynamoDB, developers scaling cloud-based applications can start small with just the capacity they need and then increase the request capacity of a given table as their app grows in popularity. Their tables can also grow without limits as their users store increasing amounts of data. Behind the scenes, Amazon DynamoDB automatically spreads the data and traffic for a table over a sufficient number of servers to meet the request capacity specified by the customer. Amazon DynamoDB offers low, predictable latencies at any scale. Customers can typically achieve average service-side in the single-digit milliseconds. Amazon DynamoDB stores data on Solid State Drives (SSDs) and replicates it synchronously across multiple AWS Availability Zones in an AWS Region to provide built-in high availability and data durability.

History of NoSQL at Amazon – Dynamo

The Amazon.com ecommerce platform consists of hundreds of decoupled services developed and managed in a decentralized fashion. Each service encapsulates its own data and presents a hardened API for others to use. Most importantly, direct database access to the data from outside its respective service is not allowed. This architectural pattern was a response to the scaling challenges that had challenged Amazon.com through its first 5 years, when direct database access was one of the major bottlenecks in scaling and operating the business. While a service-oriented architecture addressed the problems of a centralized database architecture, each service was still using traditional data management systems. The growth of Amazon’s business meant that many of these services needed more scalable database solutions.

In response, we began to develop a collection of storage and database technologies to address the demanding scalability and reliability requirements of the Amazon.com ecommerce platform. We had been pushing the scalability of commercially available technologies to their limits and finally reached a point where these third party technologies could no longer be used without significant risk. This was not our technology vendors’ fault; Amazon's scaling needs were beyond the specs for their technologies and we were using them in ways that most of their customers were not. A number of outages at the height of the 2004 holiday shopping season can be traced back to scaling commercial technologies beyond their boundaries.

Dynamo was born out of our need for a highly reliable, ultra-scalable key/value database. This non-relational, or NoSQL, database was targeted at use cases that were core to the Amazon ecommerce operation, such as the shopping cart and session service. Any downtime or performance degradation in these services has an immediate financial impact and their fault-tolerance and performance requirements for their data systems are very strict. These services also require the ability to scale infrastructure incrementally to accommodate growth in request rates or dataset sizes. Another important requirement for Dynamo was predictability. This is not just predictability of median performance and latency, but also at the end of the distribution (the 99.9th percentile), so we could provide acceptable performance for virtually every customer.

To achieve all of these goals, we needed to do groundbreaking work. After the successful launch of the first Dynamo system, we documented our experiences in a paper so others could benefit from them. Since then, several Dynamo clones have been built and the Dynamo paper has been the basis for several other types of distributed databases. This demonstrates that Amazon is not the only company than needs better tools to meet their database needs.

Lessons learned from Amazon's Dynamo

Dynamo has been in use by a number of core services in the ecommerce platform, and their engineers have been very satisfied by its performance and incremental scalability. However, we never saw much adoption beyond these core services. This was remarkable because although Dynamo was originally built to serve the needs of the shopping cart, its design and implementation were much broader and based on input from many other service architects. As we spoke to many senior engineers and service owners, we saw a clear pattern start to emerge in their explanations of why they didn't adopt Dynamo more broadly: while Dynamo gave them a system that met their reliability, performance, and scalability needs, it did nothing to reduce the operational complexity of running large database systems. Since they were responsible for running their own Dynamo installations, they had to become experts on the various components running in multiple data centers. Also, they needed to make complex tradeoff decisions between consistency, performance, and reliability. This operational complexity was a barrier that kept them from adopting Dynamo.

During this period, several other systems appeared in the Amazon ecosystem that did meet their requirements for simplified operational complexity, notably Amazon S3 and Amazon SimpleDB. These were built as managed web services that eliminated the operational complexity of managing systems while still providing extremely high durability. Amazon engineers preferred to use these services instead of managing their own databases like Dynamo, even though Dynamo's functionality was better aligned with their applications’ needs.

With Dynamo we had taken great care to build a system that met the requirements of our engineers. After evaluations, it was often obvious that Dynamo was ideal for many database use cases. But ... we learned that engineers found the prospect of running a large software system daunting and instead looked for less ideal design alternatives that freed them from the burden of managing databases and allowed them to focus on their applications.

It became obvious that developers strongly preferred simplicity to fine-grained control as they voted "with their feet" and adopted cloud-based AWS solutions, like Amazon S3 and Amazon SimpleDB, over Dynamo. Dynamo might have been the best technology in the world at the time but it was still software you had to run yourself. And nobody wanted to learn how to do that if they didn’t have to. Ultimately, developers wanted a service.

History of NoSQL at Amazon - SimpleDB

One of the cloud services Amazon developers preferred for their database needs was Amazon SimpleDB. In the 5 years that SimpleDB has been operational, we have learned a lot from its customers.

First and foremost, we have learned that a database service that takes away the operational headache of managing distributed systems is extremely powerful. Customers like SimpleDB’s table interface and its flexible data model. Not having to update their schemas when their systems evolve makes life much easier. However, they most appreciate the fact that SimpleDB just works. It provides multi-data center replication, high availability, and offers rock-solid durability. And yet customers never need to worry about setting up, configuring, or patching their database.

Second, most database workloads do not require the complex query and transaction capabilities of a full-blown relational database. A database service that only presents a table interface with a restricted query set is a very important building block for many developers.

While SimpleDB has been successful and powers the applications of many customers, it has some limitations that customers have consistently asked us to address.

Domain scaling limitations. SimpleDB requires customers to manage their datasets in containers called Domains, which have a finite capacity in terms of storage (10 GB) and request throughput. Although many customers worked around SimpleDB’s scaling limitations by partitioning their workloads over many Domains, this side of SimpleDB is certainly not simple. It also fails to meet the requirement of incremental scalability, something that is critical to many customers looking to adopt a NoSQL solution.

Predictability of Performance. SimpleDB, in keeping with its goal to be simple, indexes all attributes for each item stored in a domain. While this simplifies the customer experience on schema design and provides query flexibility, it has a negative impact on the predictability of performance. For example, every database write needs to update not just the basic record, but also all attribute indices (regardless of whether the customer is using all the indices for querying). Similarly, since the Domain maintains a large number of indices, its working set does not always fit in memory. This impacts the predictability of a Domain’s read latency, particularly as dataset sizes grow.
Consistency. SimpleDB’s original implementation had taken the "eventually consistent" approach to the extreme and presented customers with consistency windows that were up to a second in duration. This meant the system was not intuitive to use and developers used to a more traditional database solution had trouble adapting to it. The SimpleDB team eventually addressed this issue by enabling customers to specify whether a given read operation should be strongly or eventually consistent.

Pricing complexity. SimpleDB introduced a very fine-grained pricing dimension called “Machine Hours.†Although most customers have eventually learned how to predict their costs, it was not really transparent or simple.

Introducing DynamoDB

As we thought about how to address the limitations of SimpleDB and provide 1) the most scalable NoSQL solution available and 2) predictable high performance, we realized our goals could not be met with the SimpleDB APIs. Some SimpleDB operations require that all data for a Domain is on a single server, which prevents us from providing the seamless scalability our customers are demanding. In addition, SimpleDB APIs assume all item attributes are automatically indexed, which limits performance.

We concluded that an ideal solution would combine the best parts of the original Dynamo design (incremental scalability, predictable high performance) with the best parts of SimpleDB (ease of administration of a cloud service, consistency, and a table-based data model that is richer than a pure key-value store). These architectural discussions culminated in Amazon DynamoDB, a new NoSQL service that we are excited to release today.

Amazon DynamoDB is based on the principles of Dynamo, a progenitor of NoSQL, and brings the power of the cloud to the NoSQL database world. It offers customers high-availability, reliability, and incremental scalability, with no limits on dataset size or request throughput for a given table. And it is fast – it runs on the latest in solid-state drive (SSD) technology and incorporates numerous other optimizations to deliver low latency at any scale.

Amazon DynamoDB is the result of everything we’ve learned from building large-scale, non-relational databases for Amazon.com and building highly scalable and reliable cloud computing services at AWS. Amazon DynamoDB is a NoSQL database service that offers the following benefits:

Managed. DynamoDB frees developers from the headaches of provisioning hardware and software, setting up and configuring a distributed database cluster, and managing ongoing cluster operations. It handles all the complexities of scaling and partitions and re-partitions your data over more machine resources to meet your I/O performance requirements. It also automatically replicates your data across multiple Availability Zones (and automatically re-replicates in the case of disk or node failures) to meet stringent availability and durability requirements. From our experience of running Amazon.com, we know that manageability is a critical requirement. We have seen many job postings from companies using NoSQL products that are looking for NoSQL database engineers to help scale their installations. We know from our Amazon experiences that once these clusters start growing, managing them becomes the same nightmare that running large RDBMS installations was. Because Amazon DynamoDB is a managed service, you won’t need to hire experts to manage your NoSQL installation—your developers can do it themselves.

Scalable. Amazon DynamoDB is designed to scale the resources dedicated to a table to hundreds or even thousands of servers spread over multiple Availability Zones to meet your storage and throughput requirements. There are no pre-defined limits to the amount of data each table can store. Developers can store and retrieve any amount of data and DynamoDB will spread the data across more servers as the amount of data stored in your table grows.

Fast. Amazon DynamoDB provides high throughput at very low latency. It is also built on Solid State Drives to help optimize for high performance even at high scale. Moreover, by not indexing all attributes, the cost of read and write operations is low as write operations involve updating only the primary key index thereby reducing the latency of both read and write operations. An application running in EC2 will typically see average service-side latencies in the single-digit millisecond range for a 1KB object. Most importantly, DynamoDB latencies are predictable. Even as datasets grow, latencies remain stable due to the distributed nature of DynamoDB's data placement and request routing algorithms.

Durable and Highly Available. Amazon DynamoDB replicates its data over at least 3 different data centers so that the system can continue to operate and serve data even under complex failure scenarios.

Flexible. Amazon DynamoDB is an extremely flexible system that does not force its users into a particular data model or a particular consistency model. DynamoDB tables do not have a fixed schema but instead allow each data item to have any number of attributes, including multi-valued attributes. Developers can optionally use stronger consistency models when accessing the database, trading off some performance and availability for a simpler model. They can also take advantage of the atomic increment/decrement functionality of DynamoDB for counters.

Low cost. Amazon DynamoDB’s pricing is simple and predictable: Storage is $1 per GB per month. Requests are priced based on how much capacity is reserved: $0.01 per hour for every 10 units of Write Capacity and $0.01 per hour for every 50 units of Read Capacity. A unit of Read (or Write) Capacity equals one read (or write) per second of capacity for items up to 1KB in size. If you use eventually consistent reads, you can achieve twice as many reads per second for a given amount of Read Capacity. Larger items will require additional throughput capacity.

In the current release, customers will have the choice of using two types of keys for primary index querying: Simple Hash Keys and Composite Hash Key / Range Keys:

Simple Hash Key gives DynamoDB the Distributed Hash Table abstraction. The key is hashed over the different partitions to optimize workload distribution. For more background on this please read the original Dynamo paper.

Composite Hash Key with Range Key allows the developer to create a primary key that is the composite of two attributes, a “hash attribute†and a “range attribute.†When querying against a composite key, the hash attribute needs to be uniquely matched but a range operation can be specified for the range attribute: e.g. all orders from Werner in the past 24 hours, all log entries from server 16 with clients IP addresses on subnet 192.168.1.0

Performance Predictability in DynamoDB

In addition to taking the best ideas of Dynamo and SimpleDB, we have added new functionality to provide even greater performance predictability.

Cloud-based systems have invented solutions to ensure fairness and present their customers with uniform performance, so that no burst load from any customer should adversely impact others. This is a great approach and makes for many happy customers, but often does not give a single customer the ability to ask for higher throughput if they need it.

As satisfied as engineers can be with the simplicity of cloud-based solutions, they would love to specify the request throughput they need and let the system reconfigure itself to meet their requirements. Without this ability, engineers often have to carefully manage caching systems to ensure they can achieve low-latency and predictable performance as their workloads scale. This introduces complexity that takes away some of the simplicity of using cloud-based solutions.

The number of applications that need this type of performance predictability is increasing: online gaming, social graphs applications, online advertising, and real-time analytics to name a few. AWS customers are building increasingly sophisticated applications that could benefit from a database that can give them fast, predictable performance that exactly matches their needs.

Amazon DynamoDB’s answer to this problem is “Provisioned Throughput.†Customers can now specify the request throughput capacity they require for a given table. Behind the scenes, DynamoDB will allocate sufficient resources to the table to predictably achieve this throughput with low-latency performance. Throughput reservations are elastic, so customers can increase or decrease the throughput capacity of a table on-demand using the AWS Management Console or the DynamoDB APIs. CloudWatch metrics enable customers to make informed decisions about the right amount of throughput to dedicate to a particular table. Customers using the service tell us that it enables them to achieve the appropriate amount of control over scaling and performance while maintaining simplicity. Rather than adding server infrastructure and re-partitioning their data, they simply change a value in the management console and DynamoDB takes care of the rest.

Summary

Amazon DynamoDB is designed to maintain predictably high performance and to be highly cost efficient for workloads of any scale, from the smallest to the largest internet-scale applications. You can get started with Amazon DynamoDB using a free tier that enables 40 million of requests per month free of charge. Additional request capacity is priced at cost-efficiently hourly rates as low as $.01 per hour for 10 units of Write Capacity or 50 strongly consistent units of Read Capacity (if you use eventually consistent reads you can get twice the throughput at the same cost, or the same read throughput at half the cost) Also, replicated solid state disk (SSD) storage is $1 per GB per month. Our low request pricing is designed to meet the needs of typical database workloads that perform large numbers of reads and writes against every GB of data stored.

To learn more about Amazon DynamoDB its functionality, APIs, use cases, and service pricing, please visit the detail page at aws.amazon.com/DynamoDB and also the Developer Guide. I am excited to see the years of experience with systems such as Amazon Dynamo result in an innovative database service that can be broadly used by all our customers.

Countdown to What is Next in AWS

| Comments ()

Join me at 9AM PST on Wednesday January 18, 2012 to find out what is next in the AWS Cloud. Registration required.

[image]

Today, Amazon Web Services is expanding its worldwide coverage with the launch of a new AWS Region in Sao Paulo, Brazil. This new Region has been highly requested by companies worldwide, and it provides low-latency access to AWS services for those who target customers in South America.

South America is one of the fastest growing economic regions in the world. In particular, South American IT-oriented companies are seeing very rapid growth. Case in point: over the past 10 years IT has risen to become 7% of the GDP in Brazil. With the launch of the South America (Sao Paolo) Region, AWS now provides companies large and small with infrastructure that allows them to get to market faster while reducing their costs which enables them to focus on delivering value, instead of wasting time on non-differentiating tasks.

Local companies have not been the only ones to frequently ask us for a South American Region, but also companies from outside South America who would like to start delivering their products and services to the South American market. Many of these firms have wanted to enter this market for years but had refrained due to the daunting task of acquiring local hosting or datacenter capacity. These companies can now benefit from the fact that the new Sao Paulo Region is similar to all other AWS Regions, which enables software developed for other Regions to be quickly deployed in South America as well.

Several prominent South American customers have been using AWS since the early days. The new Sao Paulo Region provides better latency to South America, which enables AWS customers to deliver higher performance services to their South American end-users. Additionally, it allows them to keep their data inside of Brazil. In the words of Guilherme Horn, the CEO of ÓRAMA, a Brazilian financial services firm and AWS customer: “The opening of the South America Sao Paulo Region will enable greater flexibility in developing new services as well as guarantee that we will always be compliant to the needs of the regulations of the financial markets.â€

You can learn more about our growing global infrastructure footprint at http://aws.amazon.com/about-aws/globalinfrastructure. Please also visit the AWS developer blog for more great stories from our South American customers.

Expanding the Cloud - Introducing Amazon ElastiCache

| Comments ()

Today AWS has launched Amazon ElastiCache, a new service that makes it easy to add distributed in-memory caching to any application. Amazon ElastiCache handles the complexity of creating, scaling and managing an in-memory cache to free up brainpower for more differentiating activities. There are many success stories about the effectiveness of caching in many different scenarios; next to helping applications achieving fast and predictable performance, it often protects databases from requests bursts and brownouts under overload conditions. Systems that make extensive use of caching almost all report a significant reduction in the cost of their database tier. Given the widespread use of caching in many of the applications in the AWS Cloud, a caching service had been high on the request list of our customers.

[image]

Caching has become a standard component in many applications to achieve a fast and predictable performance, but maintaining a collection of cache servers in a reliable and scalable manner is not a simple task. These efforts clearly fall into the category of "operational muck", but given the widespread usage of caching, maintenance of cache servers is no longer a differentiator and everyone will have to uptake it as the "costs of doing business". Amazon ElastiCache takes away many of the headaches of deploying, operating and scaling the caching infrastructure. A Cache Cluster, which is a set of collaborating Cache Nodes, can be started in minutes. Scaling the total memory in the Cache Cluster is under complete control of the customers as Caching Nodes can be added and deleted on demand. Amazon Cloudwatch can be used to get detailed metrics about the performance of the Cache Nodes. Amazon ElastiCache automatically detects and replaces failed Cache Nodes to protect the cluster from those failure scenarios. Access to the Cache Cluster is controlled using Cache Security Groups giving customers full control over which application components can access which Cache Cluster.

Amazon ElastiCache is compliant with Memcached, which makes it easy for developers who are already familiar with that system to start using the service immediately. Existing applications, tools and libraries that are using a Memcached environment can simply switch over to using Amazon ElastiCache without much effort.

For more details on Amazon ElastiCache visit the detail page of the service. For more hands-on information and to get started right away, see Jeff Barr's posting on the AWS Developer Blog. Please note that Amazon ElastiCache is currently available in the US East (Virginia) Region. It will be available in other AWS Regions in the coming months.

Job Openings in AWS - Senior Leader in Database Services

| Comments ()

There are some great job openings within Amazon Web Services. I will try to highlight some of those in coming weeks. This week it is an opening for senior leaders with AWS Database Services.

AWS Database Services is responsible for setting the database strategy and delivering distributed structured storage services to our AWS customers. This team is constantly rethinking the assumptions behind how traditional databases were built and constantly working on building the right database architectures suited for the Cloud environment. The database services organization is looking for senior leaders who will be able to hire and lead a large software development team that is responsible for designing and running services that are at the cutting edge of distributed database technology that helps our customers to build scalable database-driven applications in the cloud and have a significant bottom-line impact on our business.

The ideal candidate will be someone who has built and ran large scale distributed systems and/or databases. She (or he) will be able to reason about the standard tradeoffs in building large scale distributed databases and is capable of guiding the team to make these tradeoffs.

For more information: Head of Software Development 

Driving down the cost of Big-Data analytics

| Comments ()

The Amazon Elastic MapReduce (EMR) team announced today the ability to seamlessly use Amazon EC2 Spot Instances with their service, significantly driving down the cost of data analytics in the cloud. Many of our Big-Data customers already saw a big drop in their AWS bill last month when the cost of incoming bandwidth was dropped to $0.00. Now, given that historically customers using Spot Instances have seen cost saving up to 66% over On-Demand Instance prices, Amazon EMR customers are poised to achieve even greater cost savings.

Analyzing vast amounts of data is critical for companies looking to incorporate customer insights into their business, including building recommendation engines or optimizing customer targeting. Hadoop is quickly becoming the preferred tool for this type of large scale data analytics. However, Hadoop users often waste significant intellectual bandwidth on managing clusters and running Hadoop jobs rather than focusing on creating value through analytics. Amazon Elastic MapReduce takes away much of this muck by providing a hosted Hadoop framework that enables businesses, researchers, data analysts, and developers to easily and efficiently spin up resizable clusters for distributed processing of large data sets.

[image]

An interesting observation is that data analytics is no longer the purview of large enterprises. Every young business launching today knows they must integrate data collection and analytics from the start. In order to compete in today’s market, these companies must have a deep understanding of their customers’ behavior, allowing them to continuously improve how they serve them. Launching a business with a minimally viable product and then rapidly iterating in the direction that customers lead them is becoming a standard approach to success. However, this cannot be done without efficient, scalable data analytics. Many of these startups are using Hadoop for data processing and Amazon Elastic MapReduce is the ideal environment for them: it provides instant scalability and lets them focus on analytics while EMR handles the hassle of running the various Hadoop components. Given the initial shoestring budget of many of these new companies, driving down the overall cost of analytics using Spot Instances is a huge benefit.

There are three categories of instances in an Amazon EMR cluster: 1) the Master Instance Group which contains the Hadoop Master Node that schedules the various tasks, 2) the Core Instance Group which contains instances that both store the data to be analyzed and run map and reduce tasks, and 3) the Task Instance Group which only runs map and the reduce tasks. For each instance group, you can decide to use On-Demand Instances (possibly from your Reserved Instances pool) or Spot Instances. If you choose to use Spot Instances you provide the bid price you are willing to pay for each instance in that group. If the current Spot Price is below the bid price, the Instance Group will launch. The instance groups in which Spot Instances are appropriate depends on the use case. For example, for data-critical workloads you might decide to run only the Task Group on Spot Instances, with the Core Group on On-Demand, while if you are performing application testing you may decide to run all Instance Groups using Spot Instances.

If you want a quick introduction on how to get started with mixing Spot Instances with On-Demand Instances in an Amazon EMR cluster, watch this Getting Started Video. More details can be found in the Spot Instances Section of the Amazon Elastic MapReduce Developer Guide. The posting on the AWS developer blog also has some more background.

No Server Required - Jekyll & Amazon S3

| Comments ()

As some of you may remember I was pretty excited when Amazon Simple Storage Service (S3) released its website feature such that I could serve this weblog completely from S3. If you have a largely static site you can rely on the enormous power of S3 to make serving your content highly scalable and storing it extremely durable. Amazon S3 is much more than just storage; the network and distributed systems infrastructure to ensure that content can be served fast and at high rates without customers impacting each other, is amazing. Just dropping your website in an S3 bucket brings all that power to you.

And it is not just purely static websites. The increasing sophistication of client-side JavaScript has redefined what dynamic means; where in the past dynamic content would be mainly server generated, today much content is served statically with JavaScript on the client side doing the dynamic modifications. A good example is the comments section on this blog; a few lines of JavaScript and these pages have a dynamic nature with comments, trackbacks and social media discussion showing up as they happen.

But while this blog happily runs out of S3, the process of creating and updating the content still required a server to run my Moveable Type installation and hold the database. I took my time to figure out what weblog CMS I was going to use to free me from having to run a server. Of course the easiest would have been to just install Wordpress on a Amazon EC2 micro instance and use a plugin to convert wordpress php to static pages and then sync that to S3. But I really want a setup that allows me to thinker with the blog where ever I am (e.g. at 30,000 feet). Ideally for me my blog content would sit in DropBox and I just run a converter to generate a version of the website whenever I want, regardless which laptop I have with me. This left me with two top choices: Cactus and Jekyll.

Cactus is a static website generator developed by Koen Bok of Made By Sofa (recently acquired by Facebook). It is simple and elegant, as you would expect from someone who has won several design awards. It is written in Python and makes use of the Django templates, which makes it very powerful. Cactus had my preference as learning more about Django was still on my todo list. Although there are some good examples that come with Cactus is still early days and there is not much of a community using it. Combine that with the generic power of Django templates and my task lists for figuring out each of the pieces for my blog was substantial. I decided to let it rest for a moment (sorry Koen) and get back to it later when I can more easily step in the shoes of others.

Jekyll

Jekyll also is a static website generator. It has been developed by Tom Preston-Werner of GitHub fame. It is in daily use to generate much of the GitHub pages and a whole series of weblogs. Next to that there is a very active community developing plugins and extensions which address a number of things that I want to do with the blog in the future. Jekyll in written in Ruby and uses YAML for metadata management and uses the Liquid template engine to manipulate the content. Let there be no mistake: Jekyll is not a polished high-end dashboard driven CMS, it is best described by TPW's original charge: Blogging like a Hacker. Which suits me just fine.

I have now for the most part replicated the way that my blog was generated in MT but now using Jekyll. I am still using the same layout and css I used with MT, as I prefer to make one change at the time: design comes next. I have regenerated all pages since 2005, the pages before that can be found in the "/historical" section. There are a number of pages in the "categories" section that have not been regenerated as according to the website statistics not many of those were accessed.

My templates and blog posts are now located in DropBox and thus locally cached at each machine I use. I simple have to run Jekyll to generate a version of the site and s3cmd takes care of the rest.

In the coming days I will cleanup the templates and put them in GitHub for others to reuse. I will also submit my convertor to transform an MT installation using SQLite into input for Jekyll.

I am grateful to Matt Mullenweg for the magnificent Wordpress, it is not your fault I didn't want to run a server, to Koen Bok for the elegant Cactus, I am sure to come back to it when I have more guts and time, and to Tom Preston-Werner for enabling me to Blog like a Hacker.

No Server Required. Amazon S3 FTW!

Expanding the Cloud - The AWS GovCloud (US) Region

| Comments ()

Today AWS announced the launch of the AWS GovCloud (US) Region. This new region, which is located on the West Coast of the US, helps US government agencies and contractors move more of their workloads to the cloud by implementing a number of US government-specific regulatory requirements.

The concept of regions gives AWS customers control over the placement of their resources and services. Next to GovCloud (US) there are five general purpose regions; two in the US (one on the west coast and one on the east coast), one in the EU (in Ireland) and two in APAC (in Singapore and Tokyo). There are different considerations when deciding where to allocate resources with latency and cost being the two obvious ones, but compliance sometimes plays an important role as well. For example a number of our European customers are subject to data residency requirements when it comes to PII data and they use the EU Region to meet to those requirements.

Our government customers sometimes have an additional layer of regulatory requirements given that they at times deal with highly sensitive information, such as defense-related data. These customers are satisfied with the general security controls and procedures in AWS but in these more sensitive cases they often need assurances that only personnel that meet certain requirements, e.g. citizenship or permanent residency, can access their data. AWS GovCloud (US) implements specific requirements of the US government such that agencies at the federal, state and local levels can use the AWS cloud for their more sensitive workloads.

Cloud First

The US Federal Cloud Computing Strategy lays out a “Cloud First†strategy which compels US federal agencies to consider Cloud Computing first as the target for their IT operations:

To harness the benefits of cloud computing, we have instituted a Cloud First policy. This policy is intended to accelerate the pace at which the government will realize the value of cloud computing by requiring agencies to evaluate safe, secure cloud computing options before making any new investments

By leveraging shared infrastructure and economies of scale, cloud computing presents a compelling business model for Federal leadership Organizations will be able to measure and pay for only the IT resources they consume, increase or decrease their usage to match requirements and budget constraints, and leverage the shared underlying capacity of IT resources via a network Resources needed to support mission critical capabilities can be provisioned more rapidly and with minimal overhead and routine provider interaction.

Given the current economic climate, reducing cost within the US federal government is essential -- and an aggressive move to cloud will have a substantial positive impact on the governments IT budget. The move to the cloud is projected by 2015 see a reduction of 30% in IT infrastructure costs, which amounts to $7.2 billion. The application of the Cloud First strategy across all agencies will see many cost savings similar to what the GSA saw when they moved their main portal to the cloud: a savings of $1.7M on a yearly basis while greatly improving uptime and maintainability.

With AWS’s strategy of continuous price reduction as additional economies of scale are achieved, many of these cost saving may become even more substantial without the agencies have to do anything.

Many US federal agencies are already migrating existing IT infrastructure onto the cloud using Amazon Web Services. The Cloud First strategy is most visible with new Federal IT programs, which are all designed to be “Cloud Readyâ€; many of these applications are launching on AWS from the start, and a number can be found on the AWS Federal use case list.

There were however a number of programs that really could benefit from the Cloud but which had unique regulatory requirements, such as ITAR, that blocked migration to AWS. ITAR is the International Traffic in Arms Regulatory framework which stipulates for example that data must be stored in an environment where physical and logical access is restricted to US Persons. There is no formal ITAR certification process, but a review of the ITAR compliance program for AWS GovCloud (US) has been conducted and resulted in a favorable letter of attestation with respect to the stated ITAR objects. This clears the path for agencies that have IT programs that need to be ITAR-compliant to start using AWS GovCloud (US) for these applications.

This new region, like all other AWS regions, provides FISMA Moderate controls and supports existing AWS security controls and certifications such as SAS-70, ISO 27001 and PCI DSS Level 1.

Government and Big Data

One particular early use case for AWS GovCloud (US) will be massive data processing and analytics. Several agencies of very different parts of the government have needs for data analytics that really put the Big in Big-Data, sometimes several orders of magnitude larger than commonly found in industry. Examples here are certain agencies that work on national security and those that work on economic recovery; their incoming data streams are exploding in size and their needs for collecting, storing, organizing, analyzing and sharing are changing rapidly. It is very difficult for an on-premise IT infrastructure to effectively address the needs of these agencies and the time scales at which they need to operate. The scalability, flexibility and the elasticity of AWS makes it an ideal environment for the agencies to run their analytics.

Often the data streams that they operate on are not classified in nature, but the combination and aggregation of these streams using complex new algorithms may fall for example under the controls of ITAR. AWS GovCloud (US) will be used by several of these agencies to help them with their Bigger-than-Big-Data needs.

More information

As with all AWS services and regions, information on GovCloud is publicly available on the AWS website, However, given the restrictive nature of this new AWS Region, customers will need to sign an AWS GovCloud Enterprise agreement that requires a manual step beyond the usual self-service signup process. To make use of the services in this region, customers will use the Amazon Virtual Private Cloud (VPC) to organize their AWS resources.

As the name of the region already suggests, we do not envision that over time GovCloud will address only the needs of the US Government and contractors. We are certainly interested in understanding whether there are opportunities in other governments with respect to their specific regulatory requirements that could be solved by a specialized region.

For more details on the AWS GovCloud (US) visit the Federal Government section of the AWS website and the posting on the AWS developer blog.

 


You are viewing a mobilized version of this site...
View original page here

Mobilized by Mowser Mowser