Amazon redshift distribution key

11/8/2023

Amazon redshift distribution key

Read Now

It can scan all of the timestamps in the table by reading a few tens of megabytes from disk.Įven with columnar data storage, it takes time to scan large tables. By comparison, a columnar storage database keeps keeps all of the timestamps together, at a cost of roughly eight bytes per value. But each row in that OLTP table might be a few hundred bytes, which translates to reading tens of gigabytes from disk for that scan.

A typical row-based OLTP database would recognize that an index doesn’t help that query, so would perform a table-scan instead. For example, you might want to retrieve a year’s worth of orders based on their timestamp. Most decision support queries access only a few columns in a table, and benefit by not reading data for other columns. This has big benefits for both performance and storage efficiency. If you need two columns from the same table, you look at the same position in two lists. An OLTP database stores all of the columns for a single row together, while a decision support database stores each column separately, as a distinct list of values. To make this performant, decision support databases are built on two main abstractions:Īll relational database deal in terms of rows and columns. Joins are typically merges, rather than the index-based “nested loops” of an OLTP database: a decision support database take two large rowsets, orders them by the join column, and then merges the rows where that column matches. For one thing, most decision support queries start with a table-scan, something that’s anathema to OLTP developers. These sorts of queries require a completely different execution engine than that used for OLTP. For example, “select total revenue by month for the last year, stratified by how long the user had been a customer at the time of their first transaction.” In a decision support database, by comparison, you tend to write queries that summarize entire tables, perhaps with joins to other tables. Your database will be set up with foreign-key constraints to ensure that your code can’t insert a row with an invalid user or product ID. That row is probably very small: it contains the user’s ID, the ID of a row in the PRODUCT table, and a quantity. For example, if you have an eCommerce application and a customer adds an item to their cart, this translates into a single insert operation: add a row to the USER_CART table. If you’re looking at Redshift with the background of an online transaction processing (OLTP) database developer, it may seem very strange: where are the indexes?Īn OLTP application typically “touches” only a few rows for each transaction. Decision Support versus OLTP, or “Why Redshift”īefore I get started, let’s set some context. Instead, I look at the user experience, from the perspective of a person who’s been working with Redshift for many years. I don’t currently have access to a production-scale dataset, so my performance numbers are based on dummy data and should be taken with a grain of salt. There’s also a new section in the Cluster Management guide, published after I started writing this post.Īs I said, I think it’s a great idea, and one that it could be useful to several of Chariot’s clients, so I decided to spend some of the $500 in “new user” credits that AWS provides and kick the tires. AWS released a blog post that announced the service, and if you’re signed up for re:Invent virtual, the ANT216 session goes into a little more detail, including comparisons with (the existing) Provisioned Redshift. Unfortunately, there’s not a lot of information out there. And for a lot of use-cases, I think that’s a great idea. Here we are in 2021, and AWS has just announced Redshift Serverless, in which you pay for the compute and storage that you use, rather than a fixed monthly cost for a fixed number of nodes with a fixed amount of storage. And the financial services company that I worked for at the time thought it was a bargain, because it could run analysis queries that no contemporary Oracle or Sybase system could even attempt. By comparison, in the early 90s I worked with a similar system that had 64 nodes, a then-astronomical 512 GB of disk, and cost three million dollars. Here was a massively parallel database system that could be rented for 25 cents per node-hour. Amazon Redshift’s launch in 2012 was one of the “wow!” moments in my experience with AWS.

0 Comments

Amazon redshift distribution key

Leave a Reply.

Author

Archives

Categories