8/26/2023 0 Comments Redshift distribution keysWhile this takes more effort upfront in the query writing process, habit such as this will help ensure you are utilizing the cluster in a more efficient manner. If you only need a month of data, don’t pass the entire table around until the very end. When possible, filter data as early as possible in your query. Keep in mind sort keys and the columnar architecture in how it can make you more effective in reading data from tables. In terms of query optimization, think in terms of how much data you are reading, not the total size of the table. Often in EDWs we are actually just pulling out the latest week or month’s data. Just because your table is giant, doesn’t necessarily mean that the results you are extracting out (or the data being scanned) is giant. If you don’t have a clear vision of all access patterns, start with DIST STYLE EVEN, and build a baseline of access patterns on your cluster to optimize against. Before making optimization decisions, think about the data volume (result set size), query frequency, and downstream impact of optimizing towards those operations. The grander point is keep access patterns in mind when architecting tables. Knowing what your Distribution keys and Sort Keys are when joining tables can help you write better queries. How the table is architected will impact how Redshift is able to leverage query plans and so keeping that in mind can help you effectively define tables and query them. Think about table Distribution and Sort keys and how they affect queries.There are a couple of themes when looking at query performance on Redshift: Re-run the query again, and now Redshift scans much less data. In our sales dashboard, we like to focus on recent 12 months orders, let’s add order date filter and run the query and check how it’s executed.Īlter table orders alter COMPOUND sortkey (o_orderdate) > XN Hash Join DS_DIST_ALL_NONE (cost = 14923. the table will be redistributed on the clusters: When specifying DISTSTYLE KEY, the data is distributed by the values in the DISTKEY column. > XN Hash Join DS_DIST_NONE (cost = 84157. Nowadays you can use ALTER TABLE tablename ALTER DISTSTYLE KEY DISTKEY columnname, it should be followed up by VACUUM SORT ONLY tablename. Notice the join strategy, DS_BCAST_INNER, DS_DIST_OUTER, looks like lots of data shuffling happened. ![]() For more information, see Choosing a Data Distribution Style. 00 rows = 969354 width = 10)įilter: ((ca_country)::text = 'United States'::text) With this Advisor update, Amazon Redshift can now determine the appropriate distribution key, by constructing a graph representation of the SQL join history, and optimizing for data transferred across nodes when joins occur. > XN Seq Scan on customer_address d (cost = 0. I would recommend a DISTKEY of recordid (since it seems to be often JOINed). Even if you get it wrong, it will run very well. Hash Cond: ( "outer".c_current_addr_sk = "inner".ca_address_sk) The general rule for Amazon Redshift is: Set the DISTKEY to the column most commonly used in JOIN Set the SORTKEY to the column most commonly used in WHERE A table with 12 million rows is not very big for Redshift. > XN Hash Join DS_DIST_OUTER (cost = 14923. Hash Cond: ( "outer".o_custkey = ( "inner".c_customer_sk)::bigint)
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |