AWS Database Blog

Introducing Cost Allocation Tags for Amazon DynamoDB

by Nitin Sagar | on | in DynamoDB | | Comments

Nitin Sagar is a senior product manager for DynamoDB.

You can now add tags to your Amazon DynamoDB tables. Tags are simple, user-customizable key and value pairs that are supported by many AWS services. Tagging for DynamoDB provides fine-grained visibility into your DynamoDB bill. You can assign tags to your tables and see cost breakouts per tag.

Let’s consider a real-life scenario in which you have multiple DynamoDB tables for your different development environments—development, staging, and production. You can assign tags to your DynamoDB tables, representing your various environments by adding a tag and setting the tag key to Environment, and the tag value to Development, Staging, or Production.

Let’s look at how to do this using the DynamoDB console. Before you begin, ensure you have the necessary permissions for the following API operations – ListTagsOfResource and TagResource.

  1. Sign in to the AWS Management Console and open the DynamoDB console at https://console.aws.amazon.com/dynamodb/.
  2. Choose Tables, and then choose the table that you want to modify.
  3. In the Settings tab, choose Tags from the navigation menu.
  4. In the Add Tags section, type Environment for Key and Development for Value. Choose Apply Changes.

 

Settings

 

By default, a newly added tag key is inactive for the billing pipeline. You can activate cost allocation tags in the Billing console by following these steps:

  1. Sign in to the AWS Management Console and open the Billing console at https://console.aws.amazon.com/billing/.
  2. Choose Cost Allocation Tags from the navigation menu.
  3. In the User-Defined Cost Allocation Tags section, select the check box next to the Environment tag key and choose Activate.

 

CostAllocationTags

 

When you have activated your cost allocation tags, the AWS Cost Explorer provides an easy way to see a breakdown of your AWS costs by tagged AWS resources:

  1. Sign in to the AWS Management Console and open the Billing console at https://console.aws.amazon.com/billing/.
  2. Choose Cost Explorer from the navigation menu. Then choose Launch Cost Explorer.
  3. From the menu at top left, choose Monthly costs by service. Select the appropriate time range in the Time range section from the right menu.
  4. In the Filtering section, choose Tag from Filter by.
  5. Choose Environment in the tag key autocomplete field and choose Development in the tag value autocomplete field. Choose Apply.

 

Filtering

 

Your costs are now filtered by the selected tag (Environment=Development). Costs are shown starting from when you applied your tag to your AWS resources (January 15 onward in our example).

 

SaveReport

 

You can add up to 50 tags to your DynamoDB tables by using the DynamoDB Management Console, AWS CLI, or AWS SDK. Global secondary indexes (GSIs) and local secondary indexes (LSIs) associated with base DynamoDB tables are automatically tagged with the same tags that are used for the corresponding DynamoDB base tables. Tagging support for DynamoDB is available in all AWS regions.

You can read more about implementing a consistent and effective tagging strategy for your AWS resources by visiting the AWS Tagging Strategies page.

Introducing AWS Schema Conversion Tool Version 1.0.502

by Eran Schitzer | on | in Schema Conversion Tool (SCT) | | Comments

Eran Schitzer is a product manager in Amazon Web Services.

We are pleased to announce the recent release of version 1.0.502 of the AWS Schema Conversion Tool (AWS SCT). Two key capabilities included with this release are the use of Amazon Aurora with PostgreSQL compatibility as a target and integration with AWS Database Migration Service (AWS DMS).

Amazon Aurora with PostgreSQL compatibility as a target
Following the announcement of Amazon Aurora with PostgreSQL compatibility preview at re:Invent, you can now use the AWS SCT to convert your schema to use with a PostgreSQL-compatible Aurora DB cluster.

The PostgreSQL-compatible edition of Amazon Aurora delivers up to two to three times the performance of PostgreSQL, along with other PostgreSQL capabilities valuable to customers—user-defined functions and data types, complex SQL support, NoSQL and JSON support, and broad application development language support. At the same time, it takes advantage of the scalability, durability, and security capabilities of Amazon Aurora, all for one-tenth the cost of commercial grade databases such as Oracle Database and Microsoft SQL Server. To sign up for the preview, see the preview webpage.

For more information about converting database schemas, see Converting Database Schema to Amazon RDS by Using the AWS Schema Conversion Tool.

Integration with DMS
After converting a schema with SCT, you can have the associated endpoints and tasks created automatically on a DMS instance. You can run and monitor tasks from either the DMS console or from SCT.

The following procedure shows how to create a DMS task from SCT. Once the schema conversion is complete, the DMS task created through the SCT will migrate the data to the target database engine.

To create and access DMS tasks from SCT, you’ll need to define at least one service profile. You can define one or more service profiles to let SCT access your resources running on AWS, including DMS, RDS and AWS Lambda.

  1. Open SCT, choose Settings, and then choose Global settings.
  2. On the AWS Service Profiles tab, type a name for Profile name, enter your AWS access key and AWS secret key, and choose the region where the profile will be created.GlobalSettings7
  3. Choose Save to save the profile.
  4. Choose Current project settings, and then choose the Project Environment tab. For AWS Service Profile, choose the default profile for the current project, and then choose OK.CurrentProjectSettings

    Once you have defined an AWS service profile, you can define, access, and execute DMS tasks from SCT.

  5. Choose the converted schema you’d like to migrate, open the context (right-click) menu for it, and choose Create DMS Task.CreateDMSTask
  6. In the Create DMS task window, enter the task details—task name, replication instance and source and target endpoints (you can also create new endpoints). You can change the defaults, if needed, for the rest of the parameters.
  7. Choose Create.
    CreateDMSTask1At this point, a DMS task is being created in AWS, as shown by this screenshot from the DMS console.

    DMSConsole

    You can view, start, stop, resume, delete, refresh, and show logs of the new DMS task in the DMS tasks list in both SCT and DMS.

    DMSOverview

For more information about version 1.0.502, see AWS Schema Conversion Tool User Guide. You can download SCT from here.

Auditing an Amazon Aurora Cluster

by Sirish Chandrasekaran | on | in Aurora | | Comments

Sirish Chandrasekaran is a product manager at Amazon Web Services.

At re:Invent, we announced a number of new features for Amazon Aurora, the MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.

Today we are happy to announce that our advanced auditing feature is now available to all customers.

What is advanced auditing, and why is it important?

Auditing refers to logging a specified set of events and publishing these logs in a way that can be consumed either manually or by another application. These logs can form the bedrock of your governance and regulatory compliance programs. Examples of advanced auditing include log analysis, auditing of user actions (both past events and near real-time threat detection), and support for configurable alarms set on security-related events. Aurora advanced auditing is designed to provide this functionality with minimal impact on database performance.

Getting started with advanced auditing
To get started, enable advanced auditing and then take a look at the audit logs.

Enabling advanced auditing

You can enable and configure advanced auditing by setting the parameters listed following in the DB cluster parameter group. Modifying these parameters doesn’t require a DB cluster restart, and the behaviors are the same as for Aurora DB instance parameters.

You use the server_audit_logging parameter to enable or disable auditing, and the server_audit_events parameter to specify what events to log.

You use the server_audit_excl_users and server_audit_incl_users parameters to specify who gets audited:

  • If server_audit_excl_users and server_audit_incl_users are empty (the default), all users are audited.
  • If you add users to server_audit_incl_users and leave server_audit_excl_users empty, then only those users added to server_audit_incl_users are audited.
  • If you add users to server_audit_excl_users and leave server_audit_incl_users empty, then only those users added to server_audit_excl_users are not audited, and all other users are.
  • If you add the same users to both server_audit_excl_users and server_audit_incl_users, then those users are audited because server_audit_incl_users is given higher priority.

The following list describes each of the advanced auditing parameters in more detail.

server_audit_logging enables or disables audit logging. It defaults to OFF; set it to ON to enable logging.

  • Scope: Global
  • Dynamic: Yes
  • Data type: Boolean
  • Default value: OFF (disabled)

server_audit_events contains the comma-delimited list of events to log. There should be no white space between the list elements.

  • Scope: Global
  • Dynamic: Yes
  • Data type: String
  • Default value: Empty string
  • Valid values: You can log any combination of the following events:
    • CONNECT — Logs successful connections, failed connections, and disconnections. This value includes the user information.
    • QUERY — Logs all query text and query results in plain text, including queries that fail due to syntax or permission errors.
    • QUERY_DCL — Similar to Query, but returns only DCL-type queries (GRANT, REVOKE, and so on).
    • QUERY_DDL — Similar to Query, but returns only DDL-type queries (CREATE, ALTER, and so on).
    • QUERY_DML — Similar to Query, but returns only DML-type queries (INSERT, UPDATE, and so on).
    • TABLE — Logs the tables that were affected by query execution.

server_audit_excl_users contains the comma-delimited list of users whose activity isn’t logged. There should be no white space between the list elements. Connect and disconnect events aren’t affected by this variable; they are always logged if logging for them is specified. Even if a user is listed in server_audit_excl_users, that user’s actions are logged if that user is also specified in server_audit_incl_users, because that setting has higher priority than server_audit_excl_users.

  • Scope: Global
  • Dynamic: Yes
  • Data type: String
  • Default value: Empty string

server_audit_incl_users contains the comma-delimited list of users whose activity is logged. There should be no white space between the list elements. Connect and disconnect events aren’t affected by this variable; they are always logged if logging for them is specified. If a user is listed in server_audit_incl_users, that user is logged even if that user is also specified in server_audit_excl_users, because server_audit_incl_users has higher priority.

  • Scope: Global
  • Dynamic: Yes
  • Data type: String
  • Default value: Empty string

Viewing audit logs

You can view the audit logs by using the AWS Management Console. On the Instances page, select and expand the DB cluster, then choose Logs.

ResourceGroup

If you are familiar with the MariaDB Audit Plugin, you will notice a few differences in Aurora’s approach to auditing.

First, logs are presented somewhat differently:

  • The Aurora advanced auditing time stamp is formatted in Unix time format.
  • Log events are written to multiple files, and the log records aren’t in sequential order. Customers can concatenate and sort the files using the timestamp and query_id fields to produce a single file with ordered events. You can do this in Unix as follows: cat audit.log.* | sort -t”,” -k1,1 –k6,6
  • The number of files might vary based on DB instance size.
  • File rotation is fixed at 100 MB and isn’t configurable.

Also, enabling Aurora advanced auditing after migration from MySQL works differently. Audit log configuration is set using the parameter group for the DB cluster.

How does Aurora implement advanced auditing?

Although auditing capability is commonly available in both commercial databases and some open source databases, auditing in these databases typically has a significant impact on performance, especially when the load is high. One of our Aurora implementation’s main goals is to provide users with a rich set of information but without sacrificing performance.

Maintaining performance
To understand how we achieve our performance objective, let’s compare our implementation of advanced auditing to that of the MariaDB Audit Plugin. We use this as our comparison point because MySQL Community Edition doesn’t have a native audit log, and the MariaDB Audit Plugin has emerged as the most popular option to fill this void in the open source community.

MariaDB Audit Plugin uses a single thread with a single mutex for processing and writing each event. Although the design strictly preserves the order of events, it can lead to a performance drop due to the bottleneck in writing logs. If we had used a similar approach for Aurora, the performance impact would have been even more significant due to the much higher scalability and expected throughput of the engine.

To maintain our high-performance advantage, we redesigned the event processing and event writing logic. On the input side, we used a latch-free queue to store audit events without blocking other threads. On the output side, we use multiple threads writing events from the latch-free queue to multiple files. The files can be post-processed to present a complete audit log with events in order.

advanced-auditing-how

Log format
The audit log is stored separately on each instance’s local (ephemeral) storage. Each Aurora instance spreads log writes across four log files at a time:

  • Encoding: UTF-8
  • File name pattern: audit.log.[0-3].[%Y-%m-%d-%H-%M][-rotation]
  • Location: /rdsdbdata/log/audit/ (on each host)
  • Rotation: Maximum size 100 MB per log file, currently not customer-configurable. When the largest of the four log files reaches 100 MB, the system rotates to a new set of log files.
  • Clean-up: The system will clean up older audit files to free up space past a certain space consumption and/or age.
  • Log format: [timestamp],[serverhost],[username],[host],[connectionid],[queryid],[operation],[database],[object],[retcode]
Parameter Description
timestamp The Unix time stamp for the logged event with second precision.
serverhost The name of the instance that the event is logged for.
username The connected user.
host The host that the user connected from.
connectionid The connection ID number for the logged operation.
queryid The query ID number, which can be used for finding the relational table events and related queries. For TABLE events, multiple lines are added.
operation The recorded action type. Possible values are CONNECT, QUERY, READ, WRITE, CREATE, ALTER, RENAME, and DROP.
database The active database, as set by the USE command.
object For QUERY events, this is the executed query. For TABLE events, this is the table name.
retcode The return code of the logged operation.

How does our approach compare to the alternatives?

As mentioned above, many databases offer audit log capability, but suffer from poor performance when auditing is enabled. We compared the performance of Aurora against MySQL 5.7 with MariaDB Audit Plugin for a select-only workload on an 8xlarge instance. As the results following show, MySQL performance degrades significantly when auditing is turned on, where Aurora only sees modest reduction in performance. There’s a 65% reduction in throughput for MySQL 5.7 as opposed to just 15% for Aurora. In effect, Aurora’s performance edge over MySQL 5.7 more than doubles to over an order of magnitude when auditing is turned on.

advanced-auditing-benchmark1

Advanced auditing is available now, and you can start using it today! You can learn more about this feature in the advanced auditing documentation.

Indexing Metadata in Amazon Elasticsearch Service Using AWS Lambda and Python

by Amit Sharma | on | in Elasticsearch, Lambda | | Comments

Amit Sharma (@amitksh44) is a solutions architect at Amazon Web Services.

You can use Amazon S3 to implement a data lake architecture as the single source of truth for all your data. Taking this approach not only allows you to reliably store massive amounts of data but also enables you to ingest the data at a very high speed and do further analytics on it. Ease of analytics is important because as the number of objects you store increases, it becomes difficult to find a particular object—one needle in a haystack of billions.

Objects in S3 contain metadata that identifies those objects along with their properties. When the number of objects is large, this metadata can be the magnet that allows you to find what you’re looking for. Although you can’t search this metadata directly, you can employ Amazon Elasticsearch Service to store and search all of your S3 metadata. This blog post gives step-by-step instructions about how to store the metadata in Amazon Elasticsearch Service (Amazon ES) using Python and AWS Lambda.

ServicesUsing S3 event notifications and Lambda triggers
In this post, we use S3 event notifications and Lambda triggers to maintain metadata for S3 objects in Amazon ES. S3 notification enables you to receive notifications when certain events happen in your bucket. These events can be for any action in an S3 bucket, such as PUT, COPY, POST, DELETE, and so on. More details about S3 event notifications are available in the AWS documentation.

S3 event notifications integrate with Lambda using triggers. Using this integration, you can write Lambda functions that process Amazon S3 events. To do this, in Amazon S3 you add a bucket notification configuration that identifies the type of event that you want Amazon S3 to publish and the Lambda function that you want to invoke.

AWSAccount

High-level flow between S3 and Lambda

Putting it together
To put all these parts together, you can take the following steps.
Configuring AWS Lambda with Amazon S3
To configure Lambda with S3, start by choosing AWS Lambda on the console.

Lambda1

If this is the first time you’ve created a Lambda function, choose Get Started Now.

GetStartedLambda

Choose Configure triggers.

ConfigureTriggers

On the next page, you should be able to select the triggers you want to work with.

ConfigureTriggersNext

Choose the S3 bucket and the type of event that you want to capture. You can leave the Prefix and Suffix fields blank or, based on your use case, fill them in.

For example, if you expect all files to come in a folder called /appTier/appServer1, you can use that path as the Prefix value. Similarly, if you expect the files to arrive with a certain suffix like .log, .jpg, .avi, and so on, you can use that in the Suffix field. Events are triggered for an object only if both the Prefix and Suffix fields are matched. Also, select the Enable Trigger check box.

EnableTrigger

Next, provide a name and description and choose Python 2.7 as the run-time environment. Because we are going to upload the code separately, choose Upload a .ZIP file for Code entry. Leave the handler information as the default: lambda_function.lambda_handler.

ConfigureFunction1

Now, let’s create the AWS Identity and Access Management (IAM) roles and related permissions so that our Lambda function can access the AWS resources we need. To do this, choose Create a new role from template(s), give a name to this new role, and for Policy templates, choose S3 object read-only-permission.

ChooseTemplate

ChooseTemplate1

In Advanced settings, leave the Memory, Timeout, and VPC settings as the default. Choosing Next will create the Lambda function and also associates the right permissions in S3 so you can invoke this Lambda function. You can verify this by checking this in S3 console. To do this, go to the properties of the S3 bucket you specified earlier and to the Events section, as shown following:

Events

Choose the modify icon to see the details and verify the name of the Lambda function.

Creating the Amazon ES domain
Now, let’s create the Amazon ES domain. Go to Services, and choose Elasticsearch Service in Analytics:

ES

Choose the Get Started button on the front page and type a name for your domain (I chose my-es-cluster):

ESDomain

As shown following, choose an instance type and an instance count (both can be changed later if necessary). We recommend choosing m3.medium or larger if you are planning to put this feature into production. Alternatively, t2.micro is a good choice if you are creating a development environment or a small proof of concept.

For storage, you have choices between instance-based storage and various types of Amazon EBS volumes (General Purpose, Provisioned IOPS and Magnetic). Start with a General Purpose EBS volume and monitor the overall performance with the FreeStorageSpace, JVMMemoryPressure, and CPUUtilization metrics and metrics about query response times before changing the storage type. For a good reference to handling errors and mitigations, see the AWS documentation.

An important question is: How much storage do you need? For example, if every object uploaded to S3 has metadata sized 1 KB and you expect 10 million objects, you should provision a total of at least 20 GB: 10 GB for the primary instance and an additional 10 GB for the replica. For a more detailed discussion on scaling and capacity planning for Elasticsearch, see the Elasticsearch documentation.

ConfigureCluster

Next, set the access policy. I chose to make mine wide open in order to simplify testing, but don’t do this for your cluster. I could also have used one of the IP-based or user-based templates in the wizard to create a more restrictive policy. For more details on controlling access to your cluster, see this blog post.

SetupAccess

Finally, review the settings and choose Confirm and create. That’s it! The cluster will be created in a few minutes.

ESCluster

Creating the Lambda function
Now comes the main code that will actually push the metadata coming from every trigger generated by object creation events. Remember that Lambda has been configured with an execution role that has read-only permissions to read from S3. At a high level, the Python code does the following:

  • Reads the metadata from S3 event
  • Connects to the Amazon ES domain endpoint
  • Creates an index if one has not already been created
  • Writes the metadata into Amazon ES

To connect to Amazon ES, the Python code uses a few specific libraries such as Elasticsearch, RequestsHttpConnection, and urllib. We are going to upload the code to the Lambda function so you can download these packages in a specific folder by using the following command. But first, make sure pip is installed—find steps to do this on the pip website. Note that the sample code available for download includes all the required libraries, so this step is optional and given here mainly for your understanding:

pip install requests -t /path/to/project-dir
pip install Elasticsearch -t /path/to/project-dir
pip install urllib3 -t /path/to/project-dir

Make sure these libraries are now available in the current directory. Now we are ready to look at the code.

The following function connects to Amazon ES:

def connectES(esEndPoint):
 print ('Connecting to the ES Endpoint {0}'.format(esEndPoint))
 try:
  esClient = Elasticsearch(
   hosts=[{'host': esEndPoint, 'port': 443}],
   use_ssl=True,
   verify_certs=True,
   connection_class=RequestsHttpConnection)
  return esClient
 except Exception as E:
  print("Unable to connect to {0}".format(esEndPoint))
  print(E)
  exit(3)

This function takes the domain endpoint as an argument and returns the Elasticsearch client instance. Be sure to use your domain’s endpoint to declare esClient:
esClient = connectES("search-domainname-yourDomainEndpoint.REGION.es.amazonaws.com")
The following function creates an Amazon ES index:

def createIndex(esClient):
 try:
  res = esClient.indices.exists('metadata-store')
  print("Index Exists ... {}".format(res))
  if res is False:
   esClient.indices.create('metadata-store', body=indexDoc)
   return 1
 except Exception as E:
  print("Unable to Create Index {0}".format("metadata-store"))
  print(E)
  exit(4)

Note that this function takes esClient as an instance of the Elasticsearch client returned by the connectES function. Also note that ‘metadata-store’ and ‘indexDoc’ are the name and mapping of the index we are trying to create. The ‘indexDoc’ mapping is defined following:

indexDoc = {
 "dataRecord" : {
  "properties" : {
   "createdDate" : {
    "type" : "date",
    "format" : "dateOptionalTime"
   },
   "objectKey" : {
    "type" : "string",
    "format" : "dateOptionalTime"
   },
   "content_type" : {
    "type" : "string"
   },
   "content_length" : {
    "type" : "long"
   },
   "metadata" : {
    "type" : "string"
   }
  }
 },
"settings" : {
 "number_of_shards": 1,
 "number_of_replicas": 0
 }
}

We are storing five fields:

  • createdDate
  • objectKey
  • content_type
  • content_length
  • metadata

As part of this, there’s a couple of important points to consider.

First, it’s important to plan your shards. The best number of primary and replica shards depends upon multiple things such as instance sizes, amount of data, frequency of new data being generated and old data being purged, query types, and so on. To give an example, for time-series data (for example, Logfile) you can maintain different indexes per hour, per day, and per week depending upon the speed of data being generated—we recommend daily indexes in most cases. Because older logs are less likely to be queried, you can re-index those to lower primary shard numbers or else drop the entire index. A more detailed discussion is provided in the Elasticsearch documentation.

Also, consider using bulk indexing. The preceding code sample works fine for a lot of use cases with low to moderate traffic—for example, up to 100 PUTs per second on S3 with 1KB of metadata. However, for higher traffic volumes we recommend to use larger instances and instead of indexing every document use the _bulk index API call to efficiently dump the data into an Elasticsearch cluster. In a follow-up blog, we will give architectural patterns and recommendations on how to do _bulk indexing efficiently and cost-effectively.

For a detailed explanation about shard settings as part of the cluster planning, refer to the Elasticsearch documentation.

Following is the function that actually writes metadata into Elasticsearch:

def indexDocElement(esClient, key, response):
  try:
   indexObjectKey = key
   indexcreatedDate = response['LastModified']
   indexcontent_length = response['ContentLength']
   indexcontent_type = response['ContentType']
   indexmetadata = json.dumps(response['Metadata'])
   retval = esClient.index(index='metadata-store', doc_type='images', body={
     'createdDate': indexcreatedDate,
     'objectKey': indexObjectKey,
     'content_type': indexcontent_type,
     'content_length': indexcontent_length,
     'metadata': indexmetadata
   })
  except Exception as E:
    print("Doc not indexed")
    print("Error: ",E)
    exit(5)

This function takes esClient, an S3 object key, and the complete response of the S3.get_object function. This response contains the actual metadata. The elements in response are indexed by calling esClient.index. The document ID is autogenerated by Elasticsearch. You can see all the index options in the Elasticsearch documentation.

Finally, following is the main Lambda handler code that calls all these functions at the invocation when it is triggered:

def lambda_handler(event, context):
   esClient = connectES("search-domainname-yourDomainEndpoint.REGION.es.amazonaws.com ")
   createIndex(esClient)

   # Get the object from the event and show its content type
   bucket = event['Records'][0]['s3']['bucket']['name']
   key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
   try:
     response = s3.get_object(Bucket=bucket, Key=key)
     print(response)
     print("KEY: " + key)
     print("CONTENT TYPE: " + response['ContentType'])
     print("Metadata : " + json.dumps(response['Metadata']))
     print("Custom 1: " + response['ResponseMetadata']['HTTPHeaders']['x-amz-meta-custom1'])
     print("Custom 2: " + response['ResponseMetadata']['HTTPHeaders']['x-amz-meta-custom2'])
     indexDocElement(esClient,key,response)
     return response['ContentType']
   except Exception as e:
     print(e)
     print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
     raise e

You can also download the entire handler code  from here for the index creation.

To verify that the metadata has been entered into Elasticsearch, you can use Kibana and search using the standard Elasticsearch API calls and queries. For example, you can query by object name as shown following:

bash# curl -XGET https:// search-domainname-yourDomainEndpoint.REGION.esamazonaws.com/metadata-store/images/_search?pretty\&q=objectKey:YOURFILENAME

  {
    "took" : 1,
    "timed_out" : false,
    "_shards" : {
      "total" : 1,
      "successful" : 1,
      "failed" : 0
   },
   "hits" : {
     "total" : 1,
     "max_score" : 9.516893,
     "hits" : [ {
       "_index" : "metadata-store",
       "_type" : "images",
       "_id" : "AVgGSFxdQ43eQcLduwj9",
       "_score" : 9.516893,
       "_source" : {
         "content_length" : 61194,
         "objectKey" : "YOURFILENAME",
         "metadata" : "{\"custom1\": \"banana\", \"custom2\": \"shake\"}",
         "content_type" : "application/octet-stream",
         "createdDate" : "2016-10-27T13:15:54+00:00"
       }
    } ]
  }
}

Following is a screenshot of Kibana after indexing a few documents:

Kibana
Deleting metadata when an S3 object is deleted
To delete the related metadata when you delete an S3 object, follow the same steps as listed preceding—except that at event type selection, choose Object Removed Event as shown following:

ObjectREmoved

The rest of the steps remain the same. Create an additional trigger for object removal for a total of two triggers and two Lambda functions for two different types of events—object PUT, COPY, or POST and object DELETE.

Following is the main handler code:

def lambda_handler(event, context):
  esClient = connectES("search-domainname-yourDomainEndpoint.REGION.es.amazonaws.com ")

  # Get the object from the event and show its content type
  bucket = event['Records'][0]['s3']['bucket']['name']
  key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
  try:
    clearMetaData(esClient,key)
    return 'Removed metadata for ' + key
  except Exception as e:
    print(e)
    print('Error removing object metadata from Elasticsearch Domain.)
    raise e

The clearMetaData function is defined as following:

def clearMetaData(esClient,key):
   try:
    retval = esClient.search(index='metadata-store', doc_type='images', q='objectKey:' + key, fielddata_fields='_id')
    total = retval['hits']['total']
    count = 0
    while (count < total):
      docId = retval['hits']['hits'][count]['_id']
      print("Deleting: " + docId)
      removeDocElement(esClient,docId)
      count = count + 1
    return 1
  except Exception as E:
    print("Removing metadata failed")
    print("Error: ",E)
    exit(5)

This function searches the domain for the given S3 object name and calls another function, removeDocElement, with the document ID as an argument that is unique in the domain. The removeDocElement is defined as following:

def removeDocElement(esClient,docId):
  try:
    retval = esClient.delete(index='metadata-store', doc_type='images', id=docId)
    print("Deleted: " + docId)
    return 1
  except Exception as E:
    print("DocId delete command failed at Elasticsearch.")
    print("Error: ",E)
    exit(5)

This code deletes all the references to that S3 key by using the unique document ID. You can confirm the deletion from Elasticsearch index by using following command:

bash# curl -XGET https:// search-domainname-yourDomainEndpoint.REGION.es.amazonaws.com/metadata-store/images/_search?pretty\&q=objectKey:train.csv
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Following is the Amazon CloudWatch monitoring snapshot for the Elasticsearch cluster—you can see a number of metrics such as those dealing with searchable documents, free storage space, cluster health, and so on. These metrics can help you decide how to scale the cluster from both compute and storage perspective. For ex. by monitoring FreeStorageSpace or CPUUtilization you can decide to scale out or scale up the Elasticseach cluster nodes.

Monitoring

You can also download the entire handler code from here for the index deletion.

Thanks for exploring these technologies with me. For deeper information, take a look at Amazon Elasticsearch Service and AWS Lambda. Let me know in the comments below how this post works for you!

A Large-Scale Cross-Regional Migration of NoSQL DB Clusters

by Andrey Zaychikov | on | in EC2 | | Comments

Andrey Zaychikov is solutions architect at Amazon Web Services

In this blog post, I will share experiences from a cross-regional migration of large-scale NoSQL database clusters (from Europe to the U.S.). The migration was a joint project implemented by the Ops team of our reference customer, FunCorp, and AWS Solutions Architects.

“Our flagship product, iFunny, is one of the most popular entertaining apps for American youth,” said Roman Skvazh, CTO of FunCorp. “For the last five years, the service has gained a considerable audience. Our mobile apps are utilized by more than 3.5 million active users per day and are operated in all time zones. A back end of the app is implemented on Amazon Web Services and its scope is impressive.”

What Should We Migrate?

The iFunny product uses the following database services running on AWS:

• Apache Cassandra cluster – 25 nodes, 18 TB of data
• Apache Cassandra cluster – 18 nodes, 16 TB of data
• MongoDB cluster – 5 TB of data, distributed across 8 shards, each shard – replica set includes the master and two slaves
• MongoDB cluster– 150 GB of data, distributed across 4 shards, each shard – replica set includes the master and two slaves
• Elasticsearch cluster – search index of 1 TB
• Redis – one master and two slaves, with 15 GB, 10 GB, and 1 GB of data and very high write speeds

“Through the efforts of our DevOps and Back End teams, and together with AWS SA, we were to move all of our infrastructure from one AWS region to another with no downtime and no major changes in the application.”

Why Should We Migrate?

If you’ve ever asked yourself the question, “Why migrate everything?” consider FunCorp’s reason: In the beginning, FunCorp selected an AWS region based on the Russian market, but the iFunny app quickly gained enormous popularity in the United States. Up to a certain point, it was convenient to have a backend located on a different continent. However, the high networking latency began to affect the user experience. For this reason, a migration became one of the integral parts of FunCorp’s global UX improvement program.

FunCorp had some constraints and conditions:

  • Zero downtime
  • No changes in the application and database structure
  • Minimal use of third-party tools

How Should We Migrate?
At first, we planned to implement the migration project using own capabilities for geo-distributed replication for each mentioned database (Cassandra, MongoDB, Amazon ElasticSearch Service, and Redis), but we faced several significant problems unique to each database.

MongoDB

The first issue had to do with mongos for MongoDB Shard, a routing service for shard configurations. The replication between the European cluster and the USA cluster ran smoothly and was implemented by using the standard features of MongoDB, but after switching the users to a new back-end MongoDB, the performance of mongos in the U.S. decreased considerably. The problem was that a geo-distributed cluster of MongoDB supported only one set of configuration servers, which were located in Europe at the time. The following diagram illustrates the initial setup.

FlowChart

Mongos had to call the configuration servers, which were located overseas, for virtually every operation. The good news is that mongos keeps its configuration cached. However, it’s not possible to increase the number of chunks or to reorganize the data between them by relying only on caches and without connecting to the configuration servers. Therefore, the server migration should have been done quickly.

The final migration algorithm looked like this:

  • We created some instances in the target region beforehand, with a configuration identical to the current one, and then we incorporated them into the existing replica set of the initial clusters. Then we waited until the end of the data replication.
  • Just before the start of the migration process, we switched off mongos balancing. Thus, the mongos processes in the target region used a cached version of the configuration.
  • During the migration, we moved away all of the instances with the data in the source region, one by one.
  • In the replica set, there were only instances in the target region left, and new primary servers were elected from among these.
  • Finally, we migrated the replica set configuration to the new region.

Cassandra

To migrate the Cassandra clusters, we created an additional Cassandra data center in a Cassandra geo-distributed cluster with a standard EC2Snitch in the target region and connected them to existing clusters through a software VPN tunnel. After that, we started the nodes and began the replication process. The VPN tunnel was essential, because it allowed us to enable interactions between geo-distributed clusters without having to change the snitch from EC2Snitch to EC2MultiRegionalSnitch. In our case, that meant not only a change of the snitch itself, but also manual cluster scaling, support for address lists in security groups, interaction between nodes on a public IP, and many other things. We thought most of the problems were solved.

Unfortunately, just after we started the replication process, the performance of the original Cassandra cluster dropped significantly.

DC1

The problem was that Cassandra started replicating data to the new DC simultaneously, from all of the nodes in the source region to all of the nodes in the destination region, with respect to our replication factor and level of consistency. To solve the problem of performance degradation, we stopped the clusters’ replication processes and started rebuilding the nodes of the target cluster, piece by piece, two to three nodes at a time.

By taking this approach, we replicated both clusters without interrupting the work process and with almost no significant performance degradation.

After the replication process was complete, we were able to repair all of the nodes in the target cluster. After that, we switched the applications from the source DC to the new DC. Finally, we excluded the source DC from the cluster, and then terminated that DC.

Redis

We took a similar approach for the Redis migration. We created some new clusters, connected them to the existing one in the source region, and then started replication. The volume of data to be migrated was not a factor, but the data was changing at a very high speed. As a result, the data failed to be replicated within the time defined by the cluster replication window, and all of the data in the target cluster was invalidated.

After several weeks of research, we found a very simple solution to this problem with Redis replication. We created an SSH tunnel with compression and with ports forwarded to localhost:

ssh -C -L 6280:localhost:6379 $MASTER_REDIS

Next, we told the slaves to synchronize with localhost instead of the master:

redis-cli slaveof localhost:6280

Done! Now the Redis replication was running successfully. Because of compression, the replication lag did not increase and never approached a critical threshold in the replication window.

ElasticSearch
Our initial plan was to avoid the migration of the ElasticSearch search index and to instead re-create it in the destination region. Unfortunately, the procedure of index creation was too complicated. It required engagement from our development teams, which was outside the constraints and conditions of the project.

Fortunately, we found a perfect ElasticSearch plugin that allowed us to do backups, including incremental backups. You can find a version of the plugin here. (To find the right version, see the README.md.)

The use of the plugin greatly simplified the migration by:

  • Creating an ElasticSearch cluster in a destination region.
  • Creating an Amazon S3 bucket, with versioning and cross-region replication enabled.
  • Creating a snapshot of the main data set and writing it to S3. The snapshot was transferred to the new region automatically.
  • Restoring data from the snapshot in the target region.
  • Repeating the preceding steps for the incremental backups. (In our case, the process of applying the incremental backups took 12 minutes.)

The results of the project
The project was successfully implemented and the back-end migration reduced latency by at least 40% for end-users when accessing the iFunny mobile back-end API.

However, it took us more than two months of experimenting and a month and a half for an actual migration.

Conclusion
Any database migration is a complex process. Ambiguous situations, such as the non-standard behavior of a database engines or a process that places a significant impact on the network, can always occur.

The best and safest way to migrate, free of data loss or similar surprises, is to test with real data and real configurations – and only afterward, to proceed with transferring the production database to a new location. Even if your application is not cloud-native, the use of cloud services allows you to experiment and to reproduce the existing use case in a geographically distributed environment, at real scale of clusters and data.

Cross-Engine Database Replication Using AWS Schema Conversion Tool and AWS Database Migration Service

by Prahlad Rao | on | in DMS, Schema Conversion Tool (SCT) | | Comments

Prahlad Rao is a solutions architect at Amazon Web Services.

Customers often replicate databases hosted on-premises to the AWS Cloud as a means to migrate database workloads to AWS, or to enable continuous replication for database backup and disaster recovery. You can both migrate and perform continuous replication on the AWS Cloud with AWS Database Migration Service (AWS DMS).

Although moving from on-premises to the cloud using the same database engine is a common pattern, increasingly customers also use different engines between source and target when replicating. This pattern especially occurs where the source is a commercial database engine and the target is open source. Such an approach can save on licensing costs. In other use cases, the target database might be used as read-only for a reporting application that is compatible with a certain database engine, or the application team is comfortable using an open source database for their needs. Or the customer might be migrating from a commercial database to an open source database on the AWS Cloud for cost and flexibility purposes. This blog post will walk you through the process of replicating a Microsoft SQL Server database to PostgreSQL using AWS Schema Conversion Tool (AWS SCT) and AWS DMS.

AWS DMS helps you migrate databases to AWS with relative ease and security. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. The service supports homogenous migrations (like to like databases) and also heterogeneous migrations (between database platforms such as Oracle to Amazon Aurora or SQL Server to MySQL). AWS DMS can also be used for continuous data replication with high availability.

AWS SCT makes heterogeneous database migrations easier by automatically converting source database schema. AWS SCT also converts the majority of custom code, including views and functions, to a format compatible with the target database.

Heterogeneous database replication is a two-step process:

  1. Use AWS SCT to convert the source database schema (from SQL Server) to a format compatible with target database, in this case PostgreSQL.
  2. Replicate data between source and target using AWS DMS.SCT5

Step 1: Convert schema from source to target database

AWS Schema Conversion Tool is a stand-alone application that provides a project-based user interface and is available for Microsoft Windows, Mac OS X, and Linux versions. You can find detailed installation instructions in the AWS documentation, so I won’t go into all the details, but high-level steps are outlined following:

  1. Download and install the AWS SCT tool for your operating system.
  2. Install the required JDBC drivers for source and target database engines on the system that has AWS SCT installed.
  3. Once required JDBC drivers have been downloaded, provide the location of driver files to SCT as shown following, under Settings, Global Settings, Drivers.
    GlobalSettings6
  4. Create an SCT project and connect to source and target databases. SCT needs access to both source and target database engines. You can use an existing user on the database that already has the required permissions or create a new user for SCT to access the schema. The user should have necessary permissions to read the schema as required by SCT. The permissions should also provide read-only access to the views and system tables.
  • Choose New Project from the File menu. Under source and target database engine, select Microsoft SQL Server and PostgreSQL respectively.
    NewProject
  • Connect to the source and target database engines. Our source database SQL Server 2014 is on an Amazon EC2 instance, and the target database is Amazon RDS for PostgreSQL. Although we’ve used an EC2 instance as a source for Microsoft SQL database in this blog post to demonstrate functionality, you can use the same procedure to replicate a source database that is hosted in your own data center. Make sure the target database is created ahead of time. Choose Connect for Microsoft SQL Server, fill in the required details, and choose Test Connection.
    ConnectToSQLServer
  • For the target database, choose Connect to PostgreSQL, fill in the required details, and choose Test Connection.
    ConnectToPOstgreSQL

Now that the source and target databases are accessible from AWS SCT, we will begin converting schema that are compatible with the target database. Note that schema on the source database aren’t modified, but rather a new schema with a compatible structure is created on the target database. At the same time, users can continue to access the source database without any interruption.

SCT6

Select the source database on the left and refresh the database by right-clicking on the database to ensure we are connected to the latest schema set. In our case, it’s AdventureWorks2014. You also notice schema listed in the database. With SCT, you can replicate specific schema or an entire database, giving you the flexibility to migrate or replicate part of the dataset from source to target. This flexibility is useful if you want to fan out into multiple target databases from one source database. We will replicate the HumanResources schema and the dataset associated with that schema here.

Select the HumanResources schema, right-click, and choose Convert Schema.

HumanResouce

The schema will now be converted. Notice the new schema on the target PostgreSQL database on the right, following.

Schema

Expand source and target database schema, and notice SCT has used the appropriate format at the target database. Although the source SQL Server schema lists Tables, Views, and Procedures, the target PostgreSQL schema lists Tables, Views and Functions.

Functions

For schema that were not converted automatically using SCT, you can manually create appropriate schema on the target. This approach avoids the need to manually create the entire schema structure on the target database.

The database migration assessment report provides important information about schema conversion. It summarizes all conversion tasks and details the action items for the schema that can’t be converted to the target DB engine. The report can also help you analyze requirements for migrating to the cloud and for changing license type. You can find more details on creating this assessment report in the AWS documentation, and also in this nice blog post.

At this point, the schema is only reflected on the target database and has not been applied yet. Let us go ahead and apply it to the target database. To do this, select the schema on the target database, right-click, and choose Apply to database.

PostgreSQL

Let’s confirm that the schema has been applied on the target by logging into the target RDS PostgreSQL database. I’m using a psql client on my Mac to access the target database:

Psql –h <database host name RDS endpoint> -p 5432 –U <username> <database name>

PSQL

Notice the adventureworks2014_humanresources schema, created by SCT. Let’s also verify the tables as part of the schema and look at the structure of a table:

code

Now that we’ve migrated schema from source to target, it’s time to use DMS to actually replicate data from the source to the target database.

Step 2: Replicate data from source to target database

To replicate databases using AWS Database Migration Service, we’ll need to provision a replication instance to perform the tasks and replicate data from source to target database. The required size of the instance varies depending on the amount of data needed to replicate or migrate. You can find additional information about choosing replication instance in the AWS documentation. We’ll use a dms.t2.medium instance for our replication.

On the DMS console, choose Replication Instances and Create replication instance as follows. The replication instance should be able to connect to both source and target databases.

ReplicationInstance

When the replication instance has been created, create source and target endpoints to connect to the source and target database respectively. The source is SQL Server on an EC2 instance:

EndPointIDentifier

Similarly, set target endpoint details to our RDS PostgreSQL database:

TargetEndPointIDentifier

Make sure to test the connection for both the endpoints. The replication instance should be able to connect successfully to both source and target database endpoints for the replication to be successful.

Finally, we can create a replication task to replicate data from the source to target database, and also to replicate changes as they happen on the source database.

With AWS services, you can not only use our console for provisioning services, but also the AWS Command Line Interface (AWS CLI) to automate and script the migration process. Let me show you how easy it is to use the CLI to create the task and script the migration process.

Because we’ve already migrated schema to the target database, we need to make sure to specify the same schema that was created on the PostgreSQL database. Before we run the create-replication-task AWS CLI command, we need to specify important task settings and table mappings files using JSON as shown following. The task_settings.json file specifies target schema to be used when replicating data and other parameters to determine how large objects are managed. Because we’re replicating data here and not migrating, we need to ensure ApplyChangesEnabled is set to true, which ensures the task will keep target tables up-to-date by applying changes using change data capture (CDC). The other parameter will be DO_NOTHING on the table prep mode, which means data and metadata of the existing target table are not affected. The task_settings.json file for our replication task follows:

{

"TargetMetadata": {

"TargetSchema": "adventureworks2014_humanresources",

"SupportLobs": true,

"FullLobMode": false,

"LobChunkSize": 64,

"LimitedSizeLobMode": true,

"LobMaxSize": 32

},

"FullLoadSettings": {

"FullLoadEnabled": false,

"ApplyChangesEnabled": true,

"TargetTablePrepMode": "DO_NOTHING",

"CreatePkAfterFullLoad": false,

"StopTaskCachedChangesApplied": false,

"StopTaskCachedChangesNotApplied": false,

"ResumeEnabled": false,

"ResumeMinTableSize": 100000,

"ResumeOnlyClusteredPKTables": true,

"MaxFullLoadSubTasks": 8,

"TransactionConsistencyTimeout": 600,

"CommitRate": 10000

},

"Logging": {

"EnableLogging": false

},

"ControlTablesSettings": {

"ControlSchema":"",

"HistoryTimeslotInMinutes":5,

"HistoryTableEnabled": false,

"SuspendedTablesTableEnabled": false,

"StatusTableEnabled": false

},

"StreamBufferSettings": {

"StreamBufferCount": 3,

"StreamBufferSizeInMB": 8

},

"ChangeProcessingDdlHandlingPolicy": {

"HandleSourceTableDropped": true,

"HandleSourceTableTruncated": true,

"HandleSourceTableAltered": true

}

}

You can find additional details on the task_settings.json file in the AWS documentation.

The table_mappings.json file specifies tables from a particular source schema that you want to replicate to the target endpoint. Table mappings offer a lot of flexibility to migrate specific tables and datasets to a target. Migration can also be rule-based on datasets that you want to filter when replicating data. These approaches provide powerful capabilities to granularly replicate datasets across database engines. For additional details, see the AWS documentation.

In our case, because we’re only replicating HumanResources data, we will specify that in our table_mappings.json file. The table-name value % tells DMS to replicate all tables for that particular schema.

{
"rules": [
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "1",
"object-locator": {
"schema-name": "HumanResources",
"table-name": "%"
},
"rule-action": "include"
}
]
}

Now that we have the required settings to begin replication, we’ll use create-replication-task to replicate data. Because we’re replicating data as changes happen to source database, full-load-and-cdc should be selected as migration type in our command. Doing this will ensure both initial load and ongoing changes are replicated.

aws dms create-replication-task --replication-task-identifier sql-replicate --source-endpoint-arn arn:aws:dms:us-west-2:904672585901:endpoint:JT6WECWJJ4YSO7AKSISBJEJD6A --target-endpoint-arn arn:aws:dms:us-west-2:904672585901:endpoint:CA2UW3DRV5KI3FG45JW45ZEZBE --replication-instance-arn arn:aws:dms:us-west-2:904672585901:rep:LNQ6KQIT52DIAWLFQU5TCNXY4E --migration-type full-load-and-cdc --replication-task-settings 'file://task_settings.json' --table-mappings 'file://table_mappings.json'

Let’s verify task creation on the AWS Management Console:

AWSConsole1

The replication task is ready; let’s start it.

Once the task is run, the initial load will be completed, followed by any changes as they happen on the source database. On the task tab, choose Table statistics to verify tables and rows that were replicated to the target database with additional details, in our case HumanResources data.

HRData

Let’s verify data on the target PostgreSQL database:

PG1

PG2

Now that the source and target databases are being replicated, we’ll go ahead and insert a record into our source SQL Server database. Let’s insert Operations as Department and Group Name. Then we’ll verify changes are propagated to the target PostgreSQL database:

SSRS

On the DMS console under Tasks, Table Statistics, we immediately notice an entry in the Inserts column into the Department table as follows:

Stats

We verify if the data is in fact on the target database. The new row is replicated to the target database. Notice Operations in the last row.

PG3

Now let’s modify the table to add a new column to Department table on the source SQL Server database to verify data definition language (DDL) updates are also being replicated. We’ll add a new column, GroupNumber:

Properties

Let’s confirm the DDL update that we just made on the DMS console. Notice the DDL column has incremented to 1:

Query1

Finally, let’s confirm on the target PostgreSQL database:

PGDB

The target database has a new column, GroupNumber, added to the Department table.

Summary
By using the AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS), you can not only migrate databases across heterogeneous engines more easily but also perform continuous data replication. You can work with different database engines across source and target databases, giving you flexibility and lowering your cost to run your favorite database engines on the AWS Cloud, without complex migration tools or manual schema conversion methods.

 

Amazon DynamoDB at AWS re:Invent 2016—Wrap-Up

by Vrutik Ghai | on | in DynamoDB | | Comments

Vrutik Ghai is a product manager at Amazon Web Services

We wrapped up an exciting AWS re:Invent. It was great to interact with current and future Amazon DynamoDB customers and hear their feedback and suggestions.

Multiple re:Invent breakout sessions highlighted DynamoDB. These sessions consisted of deep dives, best practices, and customer talks with real-life examples from industries like gaming, adtech, IoT, and others.

In case you missed attending a session, following are links to the session recordings, along with session abstracts to give you an idea of what each session is about. We hope you find these videos useful as you leverage the performance and flexibility of DynamoDB for your applications.

  1. AWS Database State of the Union (DAT320)

    Speaker:
    Raju Gulabani, VP Database Services, AWS
    Session Abstract:
    In this session, Raju Gulabani, vice president of AWS Database Services (AWS), discusses the evolution of database services on AWS and the new database services and features we launched this year, and shares our vision for continued innovation in this space. We are witnessing an unprecedented growth in the amount of data collected, in many different shapes and forms. Storage, management, and analysis of this data requires database services that scale and perform in ways not possible before. AWS offers a collection of such database and other data services like Amazon Aurora, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon ElastiCache, Amazon Kinesis, and Amazon EMR to process, store, manage, and analyze data. In this session, we provide an overview of AWS database services and discuss how our customers are using these services today.
  2. Introduction to Managed Database Services on AWS (DAT307)

    Speakers:
    Steve Hunt, Director of Infrastructure, FanDuel
    Alan Murray, Director of Architecture, FanDuel
    Robin Spira, CTO, FanDuel
    Darin Briskman, AWS Database Services
    Session Abstract:
    In this session, we look at questions such as: Which database is best suited for your use case? Should you choose a relational database or NoSQL or a data warehouse for your workload? Would a managed service like Amazon RDS, Amazon DynamoDB, or Amazon Redshift work better for you, or would it be better to run your own database on Amazon EC2? FanDuel has been running its fantasy sports service on Amazon Web Services (AWS) since 2012. We learn best practices and insights from FanDuel’s successful migrations from self-managed databases on EC2 to fully managed database services.
  3. Deep Dive on Amazon DynamoDB (DAT304)

    Speaker:
    Rick Houlihan, Principal TPM, DBS NoSQL
    Session Abstract:
    In this session, we explore Amazon DynamoDB capabilities and benefits in detail and learn how to get the most out of your DynamoDB database. We go over best practices for schema design with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others. We explore designing efficient indexes, scanning, and querying, and go into detail on a number of recently released features, including JSON document support, DynamoDB Streams, and more. We also provide lessons learned from operating DynamoDB at scale, including provisioning DynamoDB for IoT.
  4. Migrating from RDBMS to NoSQL: How PlayStation Network Moved from MySQL to Amazon DynamoDB (DAT318)

    Speakers:
    Nate Slater, Senior Manager, AWS Solutions Architecture
    Benedikt Neuenfeldt, Architect, SIE Inc.
    Aki Kusumoto, VP of NPS Development Department, SIE Inc.
    Session Abstract:
    In this session, we talk about the key differences between a relational database management service (RDBMS) and nonrelational (NoSQL) databases like Amazon DynamoDB. You will learn about suitable and unsuitable use cases for NoSQL databases. You’ll learn strategies for migrating from an RDBMS to DynamoDB through a five-phase, iterative approach. See how Sony migrated an on-premises MySQL database to the cloud with Amazon DynamoDB, and see the results of this migration.
  5. Migrating a Highly Available and Scalable Database from Oracle to Amazon DynamoDB (ARC404)

    Speaker:
    Shreekant Mandke, Software Development Manager, Amazon Marketplace
    Session Abstract:
    In this session, we share how an Amazon.com team that owns a document management platform that manages billions of critical customer documents for Amazon.com migrated from a relational to a nonrelational database. Initially, the service was built as an Oracle database. As it grew, the team discovered the limits of the relational model and decided to migrate to a nonrelational database. They chose Amazon DynamoDB for its built-in resilience, scalability, and predictability. We provide a template that customers can use to migrate from a relational data store to DynamoDB. We also provided details about the entire process: design patterns for moving from a SQL schema to a NoSQL schema; mechanisms used to transition from an ACID (Atomicity, Consistency, Isolation, Durability) model to an eventually consistent model; migration alternatives considered; pitfalls in common migration strategies; and how to ensure service availability and consistency during migration.
  6. How Toyota Racing Development Makes Racing Decisions in Real Time with AWS (DAT311)

    Speakers:
    Jason Chambers, Toyota Racing Development
    Philip Loh, Toyota Racing Development
    Martin Sirull, AWS
    Session Abstract:
    In this session, we learn how Toyota Racing Development (TRD) developed a robust and highly performant real-time data analysis tool for professional racing. Learn how TRD structured a reliable, maintainable, decoupled architecture built around Amazon DynamoDB as both a streaming mechanism and a long-term persistent data store. In racing, milliseconds matter and even moments of downtime can cost a race. We see how TRD used DynamoDB together with Amazon Kinesis and Amazon Kinesis Firehose to build a real-time streaming data analysis tool for competitive racing.
  7. Streaming ETL for RDS and DynamoDB (DAT315)

    Speakers:
    Greg Brandt, Liyin Tang, Airbnb
    Session Abstract:
    In this session Greg Brandt and Liyin Tang, Data Infrastructure engineers from Airbnb, discuss the design and architecture of Airbnb’s streaming ETL infrastructure, which exports data from RDS for MySQL and DynamoDB into Airbnb’s data warehouse, using a system called SpinalTap. We also discuss how we leverage Apache Spark Streaming to compute derived data from tracking topics and/or database tables, and HBase to provide immediate data access and generate cleanly time-partitioned Hive tables.
  8. How DataXu Scaled Its Attribution System to Handle Billions of Events per Day with Amazon DynamoDB (DAT312)

    Speakers:
    Padma Malligarjunan, AWS
    Yekesa Kosuru, DataXu
    Rohit Dialani, DataXu
    Session Abstract:
    “Attribution” is the marketing term of art for allocating full or partial credit to individual advertisements that eventually lead to a purchase, sign up, download, or other desired consumer interaction. In this session, DataXu shares how they used DynamoDB at the core of their attribution system to store terabytes of advertising history data. The system is cost-effective and dynamically scales from 0 to 300K requests per second on demand with predictable performance and low operational overhead.
  9. Cross-Region Replication with Amazon DynamoDB Streams (DAT201)

    Speakers:
    Carl Youngblood, Lead Engineer, Under Armour
    Prahlad Rao, Solutions Architect, AWS
    Session Abstract:
    In this session, Carl Youngblood, Lead Engineer of Under Armour, shares the keys to success as Under Armour implemented cross-region replication with Amazon DynamoDB Streams. The session also includes a quick recap of DynamoDB and its features.
  10. Building Real-Time Campaign Analytics Using AWS Services (DAT310)

    Speakers:
    Radhika Ravirala, Solutions Architect, AWS
    Nabil Zaman, Software Engineer, Quantcast
    Session Abstract:
    In this session, we talk about how Quantcast used AWS services including DynamoDB to implement real-time campaign analytics. Quantcast provides its advertising clients the ability to run targeted ad campaigns reaching millions of online users. The real-time bidding for campaigns runs on thousands of machines across the world. When Quantcast wanted to collect and analyze campaign metrics in real-time, they turned to AWS to rapidly build a scalable, resilient, and extensible framework. Quantcast used Amazon Kinesis streams to stage data, Amazon EC2 instances to shuffle and aggregate the data, and Amazon DynamoDB and Amazon ElastiCache for building scalable time-series databases. With Elastic Load Balancing and Auto Scaling groups, they are able to set up distributed microservices with minimal operation overhead. This session discusses their use case, how they architected the application with AWS technologies integrated with their existing home-grown stack, and the lessons they learned.
  11. How Fulfillment by Amazon (FBA) and Scopely Improved Results and Reduced Costs with a Serverless Architecture (DAT309)

    Speakers:
    Vlad Vlasceanu, Ganesh Subramaniam & Brandon Cuff, AWS
    Session Abstract:
    In this session, we share an overview of leveraging serverless architectures to support high performance data intensive applications. Fulfillment by Amazon (FBA) built the Seller Inventory Authority Platform (IAP) using Amazon DynamoDB Streams, AWS Lambda functions, Amazon Elasticsearch Service, and Amazon Redshift to improve results and reduce costs.  Scopely shares how they used a flexible logging system built on Amazon Kinesis, Lambda, and Amazon Elasticsearch Service to provide high-fidelity reporting on hotkeys in Memcached and DynamoDB, and drastically reduce the incidence of hotkeys. Both of these customers are using managed services and serverless architecture to build scalable systems that can meet the projected business growth without a corresponding increase in operational costs.
  12. 6 Million New Registrations in 30 Days: How the Chick-fil-A One App Scaled with AWS (DAT313)

    Speakers:
    Chris Taylor, Director, Customer Experience Architecture, Chick-fil-A
    Andrew Baird, Solutions Architect, AWS
    Session Abstract:
    In this session, Chris Taylor from Chick-fil-A shares how they managed to scale using AWS services. Chris leads the team providing back-end services for the massively popular Chick-fil-A One mobile app that launched in June 2016. Chick-fil-A follows AWS best practices for web services and leverages numerous AWS services, including AWS Elastic Beanstalk, Amazon DynamoDB, AWS Lambda, and Amazon S3. This was the largest technology-dependent promotion in Chick-fil-A history. To ensure their architecture would perform at unknown and massive scale, Chris worked with AWS Support through an AWS Infrastructure Event Management (IEM) engagement and leaned on automated operations to enable load testing before launch.
  13. How Telltale Games Migrated Its Story Analytics from Apache CouchDB to Amazon DynamoDB (DAT316)

    Speakers:
    Zac Litton, VP of Engineering, Telltale Games
    Greg McConnel, Solutions Architect, AWS
    Session Abstract:
    In this session, you’ll learn about Telltale Games’ migration from Apache CouchDB to Amazon DynamoDB, the challenges of adjusting capacity to handling spikes in database activity, and how Telltale Games streamlined its analytics storage to provide new perspectives of player interaction to improve its games. Every choice made in Telltale Games titles influences how your character develops and how the world responds to you. With millions of users making thousands of choices in a single episode, Telltale Games tracks this data and leverages it to build more relevant stories in real time as the season is developed.
  14. Capturing Windows of Opportunity: Real-Time Analytics for Less Than $1000? (DAT208)

    Speakers:
    Craig Stires, Head of Big Data & Analytics APAC, AWS
    Session Abstract:
    In this session, we look at how some AWS customers are using real-time analytics to capture windows of opportunity: a telco with a major promotion, an advertising retargeter with global demands, and a personal IoT provider with a lifestyle solution. We dig deeper into their architecture and look for common patterns that can be used to build a real-time analytics platform in a cost-optimized way. We even see how a light-load, real-time analytics system can be built for less than $1000.

Modifying Tasks and Table Mapping in the Console in AWS Database Migration Service

by Eran Schitzer | on | in DMS | | Comments

Eran Schitzer is a product manager in Amazon Web Services.

We’ve recently added two features in AWS Database Migration Service (AWS DMS)—modifying tasks and table mapping through the AWS Management Console.

These two new features represent a new step in our long journey to improve and enhance our migration and replication service based on the feedback we receive from our customers.

Let’s outline some highlights of modifying tasks and table mapping through the console.

Modifying tasks
In the past, users had to use the AWS CLI to modify tasks or delete and recreate the migration task to modify a task’s settings in the AWS Management Console. Now, you can modify most of the settings while the task is in stopped mode by using the console. You modify a task in the DMS console by simply selecting the task and choosing the Modify button. You can also use the AWS CLI or AWS DMS API command ModifyTask.

This new capability gives you the flexibility to make the necessary adjustments to migration or replication projects without the need to delete and recreate a task. For example, you can now change the log level of a DMS task or change the table mapping between source and target databases.

For more information about this new feature, see the DMS documentation.

Table mapping through the console
With table mapping, you can specify the tables from a source database that you want to migrate to a target database. The simplest mapping is to migrate all the tables from a source database to a target without any modifications. You can also use table mapping to specify individual tables in the source database to migrate and the schema to use for the migration. In addition, you can use filters to specify what data from a given table column you want replicated to the target database.

Until this feature’s addition, you performed table mapping by creating and modifying a JSON file. Now, users can now use the console to perform table mapping, including specifying table selection and transformations. In the console user interface, you use the Where section to specify the schema, table, and action (include or exclude). In addition, you can use transformations from the console to rename a schema or table, add a prefix or suffix to a schema or table, or remove a table column.

How to set up table-mapping selection rules for a table
The following procedure shows how to set up selection rules for a table called Customers in a schema called EntertainmentAgencySample. You create selection rules and transformations on the Guided tab, which only appears when you have a source endpoint that has schema and table information.

  1. Sign in to the AWS Management Console and choose AWS DMS.
  2. On the Dashboard page, choose Tasks.
  3. Choose Create Task.
  4. Type the task information, including Task nameReplication instanceSource endpointTarget endpoint, and Migration type. Choose the Guided tab in the Table mappings section.
    CreateTask
  5. In the Table mapping section, choose the schema name and table name. You can use “%” as a wildcard value when specifying the table name. Specify the action to be taken, to include or exclude data defined by the filter.
    StartTask
  6. Specify filter information using the Add column filter and the Add condition links. First, choose Add column filter to specify a column and conditions. Then, choose Add condition to add additional conditions. The following example shows a filter for the Customers table that includes AgencyID values between 01 and 85.
    SelectionRule
  7. When you have created the selection rules you want, choose Add selection rule.
  8. After you have created at least one selection rule, you can add a transformation to the task. To do so, choose add transformation rule.
    TableMappings
  9. Choose the target you want to transform and enter the additional information requested. The following example shows a transformation that deletes the AgencyStatus column from the Customers table.
    Rules
  10. Choose Add transformation rule.
  11. You can add additional selection rules or transformations by selecting add selection rule or add transformation rule. When you are finished, choose Create task.
    Guided

For more information about table mapping using the console, see the DMS documentation.

For more information about AWS Database Migration Service in general, see our website.

Installing JD Edwards EnterpriseOne on Amazon RDS for Oracle

by Marc Teichtahl | on | in RDS Oracle | | Comments

Marc Teichtahl is a solutions architect at Amazon Web Services.

Amazon Relational Database Service (Amazon RDS) is a web service that makes it easier to set up, operate, and scale a relational database in the cloud. It provides cost-efficient, resizable capacity for an industry-standard relational database and manages common database administration tasks. Amazon RDS offers all the benefits of high availability, security, and scalability, while eliminating the need to manage infrastructure, upgrades and updates and the complexity of handling a highly available environment.

Recently, we have been asked by many enterprises to provide guidance on how to install Oracle JD Edwards EnterpriseOne on Amazon RDS. This blog post provides an overview of installation, highlighting the modifications required to ensure success. Accompanying this post is a detailed whitepaper, providing a step-by-step guide for this process.

Key elements of the installation process for JD Edwards EnterpriseOne on Amazon RDS for Oracle include creation of an Amazon RDS Oracle instance, configuration of the SQL Plus client, installation of the platform pack, and modification of the original installation scripts. Much of this process is straightforward. However, significant changes to the original installation scripts are required. This blog post describes those changes.

You can follow the standard JD Edwards installation process found in the JD Edwards EnterpriseOne Applications Installation Guide until section 6.8, “Understanding the Oracle Installation.” At this point, the installation process varies when using Amazon RDS for Oracle.

The standard installation process creates a series of installation scripts. However, the process assumes these scripts will be executed on the database server. In this case, we are targeting Amazon RDS, and we can’t execute installation scripts on the Amazon RDS instance itself.

For the functionality we require, we need to modify the standard installation scripts in four key ways:

  1. Move the data pump files from various directories on the deployment server installation media to the DATA_PUMP_DIR directory on the RDS instance using DBMS_FILE_TRANSFER.PUT_FILE. For more information, see the Amazon RDS documentation.DataFiles
  2. Amazon RDS only supports Oracle Managed Files (OMF) for data files, log files, and control files. Thus, when creating data files and log files, you can’t specify physical file names (for more information, see the Amazon RDS documentation).
    Make syntax changes to the “CREATE TABLESPACE” statements accordingly. Take the following example.
    CreateTableSpace
    This code must be modified as shown following.
    BigFileTableSpace
  3. Change the name of the pristine data dump file and also the import data script for the TEST environment and pristine environment. The standard scripts change the import directory; we are going to change the file name. Make these changes to ORCL\InstallOracleDatabase.BAT, as shown following.
    Bat1
    Bat2
  4. Change the database grants to remove “create any directory” because this grant doesn’t work on Amazon RDS for Oracle. Amazon RDS for Oracle doesn’t support creating directories on the RDS instance.
    For example, take the following Create_dir.sql file statement.
    CreateSEssion
    Here, you change this statement to the statement following.
    CreateSEssion2

The preceding guidelines should give you enough insight into adapting JD Edwards EnterpriseOne installation to take advantage of the benefits of Amazon RDS for Oracle. For more granular step-by-step instructions, see the whitepaper.

Introducing the Aurora Storage Engine

by Steve Abraham | on | in Aurora | | Comments

Steve Abraham is a solutions architect at Amazon Web Services

What Is Amazon Aurora?
Amazon Aurora is a MySQL-compatible relational database service that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora provides up to five times better performance than MySQL with the security, availability, and reliability of a commercial database at one tenth the cost. Aurora also has an upcoming PostgreSQL-compatible edition that is available in preview.

In this post, we are going to dive deep into one of the innovations that allows Aurora to deliver this performance, availability and reliability: the storage layer.

The Evolution of Database Storage
Many customers use directly attached storage (DAS) to house their database data. Larger enterprise customers use storage area networks (SAN). These approaches introduce a number of issues:

  • SANs are expensive. Many smaller customers cannot afford the physical SAN infrastructure or storage expertise required to manage a SAN.
  • Disk storage is complex to scale out for performance. Even when done well, there are still limitations to how far DAS and SAN can scale.
  • Both DAS and SAN infrastructure exist in a single physical data center. If this data center is compromised by a power outage or network interruption, the database server is unavailable.
  • If a flood, earthquake, or other natural disaster destroys the data center, the data might be lost. If a backup of the data has been kept offsite, it can take anywhere from minutes to days to get the database server back online in a new location.

Introducing the Amazon Aurora Storage Engine
Amazon Aurora was designed to take advantage of the cloud.

Conceptually, the Amazon Aurora storage engine is a distributed SAN that spans multiple AWS Availability Zones (AZs) in a region. An AZ is a logical data center comprised of physical data centers. Each AZ is isolated from the others, except for a low latency link that allows for rapid communication with the other AZs in the region. The distributed, low latency storage engine at the center of Amazon Aurora depends on AZs.

Storage Allocation
Amazon Aurora currently builds its storage volumes in 10 GB logical blocks called protection groups. The data in each protection group is replicated across six storage nodes. Those storage nodes are then allocated across three AZs in the region in which the Amazon Aurora cluster resides.

When the cluster is created, it consumes very little storage. As the volume of data increases and exceeds the currently allocated storage, Aurora seamlessly expands the volume to meet the demand and adds new protection groups, as necessary. Amazon Aurora continues to scale in this way until it reaches its current limit of 64 TB.
StorageAllocation

How Amazon Aurora Handles Writes
When data is written to Amazon Aurora, it is sent to six storage nodes in parallel. Those storage nodes are spread across three Availability Zones, which significantly improves durability and availability.

On each storage node, the records first enter an in-memory queue. Log records in this queue are deduplicated. For example, in a case in which the master node successfully writes to the storage node, but the connection between the master and storage node is interrupted, the master node retransmits the log record, but the duplicate log record is discarded. The records that are to be kept are hardened to disk in the hot log.

After the records are persisted, the log record is written to an in-memory structure called the update queue. In the update queue, log records are first coalesced and then used to create data pages. If one or more log sequence numbers (LSNs) are determined to be missing, the storage nodes employ a protocol to retrieve the missing LSN(s) from other nodes in the volume. After the data pages are updated, the log record is backed up and then marked for garbage collection. Pages are then asynchronously backed up to Amazon S3.

When the write becomes durable by being written to the hot log, the storage node will acknowledge receipt of the data. After at least four of the six storage nodes acknowledge receipt, the write is considered successful, and acknowledgement is returned to the client application.

One of the reasons Amazon Aurora can write so much faster than other engines is that it only sends log records to the storage nodes and those writes are done in parallel. In fact, on average, compared to MySQL, Amazon Aurora requires approximately 1/8 the IOPS for a similar workload, even though data is being written to six different nodes.

All of the steps in this process are asynchronous. Writes are performed in parallel using group commit to reduce the latency and improve throughput.

Given the low write latency and the reduced I/O footprint, Amazon Aurora is ideally suited for write-intensive applications.

Fault Tolerance
The following diagram shows data stored in three AZs. The replicated data is continuously backed up to S3, which is designed to provide 99.999999999% durability.

FaultTolerance

This design makes it possible for Amazon Aurora to withstand the loss of an entire AZ or two storage nodes without any availability impact to the client applications.

ReadWriteAvailability

During recovery, Aurora is designed to sustain the loss of an AZ and yet another storage node in a protection group without any loss of data. This is because a write is made durable in Aurora only when it has been persisted to four storage nodes in a protection group. Even if three storage nodes that received the write go down, Aurora can still recover the write from the fourth storage node. To achieve read quorum during recovery, Aurora makes sure that three storage nodes are caught up to the same LSN. However, in order to open the volume for writes, Aurora has to wait for four storage nodes to recover so that it can achieve write quorum for future writes.

ReadAvailability

For reads, Aurora tries to find the closest storage node to fulfill the read. Each read request is associated with a timestamp, namely an LSN. A storage node can fulfill the read if it is caught up to the LSN (in other words, if it has received all of the updates up to that LSN).

In the event that one or more storage nodes are down or are otherwise unable to communicate with the other storage nodes, those nodes will use the gossip protocol to resynchronize themselves when they come online. If a storage node is lost and a new node is created automatically to take its place, it will also by synchronized through the gossip protocol.

With Amazon Aurora, compute and storage is decoupled. This allows for read replicas to act as compute interfaces to the storage layer without keeping any of the data on the replica itself. This enables read replicas to start serving traffic as soon as the instance comes online without having to synchronize any data. Likewise, the loss of a read replica has no impact on the underlying data. When a read replica becomes the master node, there is no data loss. With support for up to 15 read replicas and built-in load balancing across those read replicas, Amazon Aurora is ideally suited for high availability, read-intensive workloads.

Backup and Recovery
With Amazon Aurora, data is continuously backed up to S3 in real time, with no performance impact to the end user. This eliminates the need for backup windows and automated backup scripts. It also allows users to restore to any point in the user-defined backup retention period. In addition, because all backups are stored in S3, which is highly durable storage, replicated across multiple data centers and designed to provide 99.999999999% durability, the risk of data loss is reduced significantly.

Sometimes users would like to take a snapshot of the data in their Aurora cluster at a specific point in time. Amazon Aurora allows users to do this with no impact on performance.

When restoring from an Amazon Aurora backup, each 10 GB protection group restores in parallel. This means that the time it takes to restore a 64 TB backup is similar to the time it takes to restore a 10 GB backup. Additionally, after the protection groups have been restored, there is no need to reapply logs. This means that Amazon Aurora can work at peak performance as soon as the protection groups are restored.

Security
Amazon Aurora allows users to encrypt all data at rest using industry standard AES-256 encryption. Users can also manage keys by using the AWS Key Management Service (AWS KMS). Data in flight can be secured through a TLS connection to the Amazon Aurora cluster. In addition to these features, Amazon Aurora also holds multiple certifications/attestations, including SOC 1, SOC 2, SOC 3, ISO 27001/9001, ISO 27017/27018, and PCI.

Summary
Amazon Aurora was designed for the cloud. By distributing data across multiple Availability Zones, Amazon Aurora provides an extremely durable and highly available database engine. Due to many of the innovations in how and where data is stored, both read and write throughput is significantly improved over other relational database engines. So, too, is the speed with which database restore operations can occur. Finally, due to the fact that Amazon Aurora is a managed service, AWS handles all of the undifferentiated heavy lifting of backups, patching, and storage management.

Get started with Amazon Aurora today: https://console.aws.amazon.com/rds/home?region=us-east-1#launch-dbinstance:ct=gettingStarted. For a deeper dive into Amazon Aurora including some of the latest features, please check out Amazon Aurora webinar at: https://m.youtube.com/watch?v=VrWH518sObk