Forcing Smart Scans on Exadata – is the _serial_direct_read parameter safe to use in production?

May 29, 2013, 5:38 am

≫ Next: Getting the Most Out of ASH online seminar

One of the most common Exadata performance problems I see is that the direct path reads (and thus also Smart Scans) don’t sometimes kick in when running full scans in serial sessions. This is because in Oracle 11g onwards, the serial full segment scan IO path decision is done dynamically, at runtime, for every SQL execution – and for every segment (partition) separately. Whether you get a direct path read & smart scan, depends on the current buffer cache size, how big segment you’re about to scan and how much of that segment is actually cached at the moment. Note that the automatic IO path decision for index fast full scans is slightly different from table scans.

This dynamic decision unfortunately can cause unexpected surprises and variance in your report/batch job runtimes. Additionally, it looks like the SELECT part of your UPDATE/DELETE statements (the select part finds the rows to update/delete) does not ever automatically get direct path read/smart scan chosen – by design! So, when your SELECT statement may use smart scan and be really fast, the same select operation in an INSERT SELECT (or UPDATE/DELETE) context will not end up using smart scans by default. There’s even a bug explaining that – closed as “not a bug” (Exadata Smartscan Is Not Being Used On Insert As Select[Article ID 1348116.1]).

To work around these problems and force a direct path read/smart scan, you can either:

Run your query in parallel as parallel full segment scans will use direct path reads, unless your parallel_degree_policy = AUTO, then you may still get buffered reads thanks to the dynamic in-memory parallel execution decision of Oracle 11.2
Run your query in serial, but force the serial direct path reads by setting _serial_direct_read = TRUE (or ALWAYS in 11.2.0.2+)

Here are the valid options for this parameter in 11.2.0.2+

SQL> @pvalid _serial_direct_read

Display valid values for multioption parameters matching "_serial_direct_read"...

  PAR# PARAMETER                                                 ORD VALUE                          DEFAULT
------ -------------------------------------------------- ---------- ------------------------------ -------
  1993 _serial_direct_read                                         1 ALWAYS
       _serial_direct_read                                         2 AUTO
       _serial_direct_read                                         3 NEVER
       _serial_direct_read                                         4 TRUE
       _serial_direct_read                                         5 FALSE

And this leads to the question – as _serial_direct_read is an undocumented, hidden parameter – is it safe to use it in production?

In my mind, there are 3 kinds of parameters:

Documented parameters – they should be safe and should work. If they don’t, it’s a bug and should get fixed by Oracle
Undocumented parameters which nobody uses and knows much about
Undocumented parameters which are documented in My Oracle Support (and are widely used in practice)

You shouldn’t use #2 parameters in production without a written blessing by Oracle Support – and I’d like to know some justification, why the recommended parameter ought to help.

The _serial_direct_read, however belongs to category #3 – it’s an undocumented parameter, but widely documented by public use and more importantly (formally), documented in My Oracle Support. If you search MOS for _serial_direct_read, you’ll find plenty of notes recommending the _serial_direct_read as a workaround – but the best of them is MOS note Best Practices for OLTP on the Sun Oracle Database Machine[Article ID 1269706.1], which says this:

Direct reads bypass the buffer cache and go directly into the process PGA. Cell offload operations occur for direct reads only. Parallel query processes always use direct reads and therefore offload any eligible operations (ex: scans). For normal foreground processes running serially, the RDBMS decides whether direct or buffered reads are the best option based on both the state of the buffer cache and the size of the table. In the event that the RDBMS does not make the correct choice to use direct reads/offload processing, you can set _serial_direct_read=TRUE (available at session level). Keep in mind that it is possible that setting this parameter will makes things worse if the app is better off with buffered reads, so make sure you know that you want direct reads/offload processing before setting it.

Comments:

The abovementioned MOS note is actually incorrect stating that “Parallel query processes always use direct reads“, the in-memory parallel execution changes this, as explained above.
The note recommends setting the _serial_direct_read = TRUE, but I’ve used ALWAYS for clarity. The TRUE before 11.2.0.2+ really meant “AUTO”, the dynamic decision and FALSE meant NEVER. But starting from Oracle 11.2.0.2 onwards, the TRUE = ALWAYS, FALSE = NEVER and AUTO means dynamic decision (like TRUE used to mean on 11.2.0.1 and before)
This parameter forces direct reads only for full segment scans (full table scan and fast full index scan variations), your “random” index lookups and range scans etc will still use reads via buffer cache regardless of this parameter – which is great for OLTP and mixed workload systems
The _serial_direct_read parameter controls the direct path read decision both for table and index segment scans

I’m not a fan of setting such parameters at system level, but in past I have created login triggers in mixed-workload environments, where the OLTP sessions will never do a direct path read (or I leave the decision automatic) and the reporting and batch sessions in the same database will have forced direct path reads for their full segment scans.

This article served two purposes, talking about direct path reads and also evaluating whether that cool-undocumented-parameter should be used in production at all. First, you must be able to argument, why would this parameter help at all – and second, is it at least documented as a valid workaround in some MOS article. If unsure, raise an SR and get the Support’s official blessing before taking the risk.

↧

Getting the Most Out of ASH online seminar

June 5, 2013, 2:10 pm

≫ Next: Oracle Database 12c R1 (12.1.0.1.0) is finally released!

≪ Previous: Forcing Smart Scans on Exadata – is the _serial_direct_read parameter safe to use in production?

Just a little reminder – next week (10-11th June) I’ll be delivering my last training session before autumn – a short 1-day (2 x 0.5 days actually) seminar about Getting the Most Out of Oracle’s Active Session History. In the future it will act as sort of a prequel (or preparation) for my Advanced Oracle Troubleshooting class, as the latter one deliberately goes very deep. The ASH seminar’s 1st half is actually mostly about the GUI way of troubleshooting the usual performance problems (EM/Grid Control) and the 2nd half is about all my ASH scripts for diagnosing more complex stuff.

P.S. I’ll also have multiple very cool news in a few months ;-)

↧

Oracle Database 12c R1 (12.1.0.1.0) is finally released!

June 25, 2013, 11:39 am

≫ Next: Oracle 12c: Scalar Subquery Unnesting transformation

≪ Previous: Getting the Most Out of ASH online seminar

You can download it from http://edelivery.oracle.com

And now I (and other beta testers) can finally start writing about the new cool features in it! :)

Looks like only the Linux x86-64, Solaris x86-64 + SPARC ports are available first (as usual).

(just a screenshot below, you’ll need to go to http://edelivery.oracle.com to download the files)

Update: Also the Oracle 12c Documentation is now available in the OTN website.

↧

Oracle 12c: Scalar Subquery Unnesting transformation

August 13, 2013, 7:55 am

≫ Next: C-MOS: How to get rid of the annoying “The Page has Expired” dialog in My Oracle Support

≪ Previous: Oracle Database 12c R1 (12.1.0.1.0) is finally released!

I promised to write about Oracle 12c new features quite a while ago (when 12c got officially released), but I was actually on (a long) vacation then and so many cool 12c-related white-papers and blog entries started popping up so I took it easy for a while. I plan to be focusing on the less known low-level internal details anyway as you see from this blog entry.

As far as I can remember, Oracle has been able to unnest regular subqueries since 8i and merge views since Oracle 8.0.

First, a little terminology:

Update: I have changed the terminology section below a bit, thanks to Jason Bucata’s correction. A scalar subquery sure can also be used in the WHERE clause (as you can see in the comments). So, I clarified below that this blog post is comparing the “Scalar Subqueries in SELECT projection list” to “Regular non-scalar Subqueries in WHERE clause”. I am writing a Part 2 to explain the scalar subqueries in WHERE clause.

A “regular” subquery in Oracle terminology is a query block which doesn’t return rows back to the select projection list, but is only used in the WHERE clause for determining whether to return any rows from the parent query block, based on your specified conditions, like WHERE v IN (SELECT x FROM q) or WHERE EXISTS (SELECT x FROM inner WHERE inner.v=outer.v) etc.
The subquery can return zero to many rows for the parent query block for evaluation.
A scalar subquery is a subquery which can only return a single value (single row, single column) back to the parent block. It can be used both in the WHERE clause of the parent query or right in the SELECT list of the parent query instead of some column name. (In this post I am discussing only the case of a scalar subquery in SELECT clause). Whatever the scalar subquery returns, will be put in the “fake” column in the query result-set. If it returns more than one row, you will get the ORA-01427: single-row subquery returns more than one row error, if it returns no rows for the given lookup, the result will be NULL. An example is below.

I crafted a very simple SQL with a scalar subquery for demo purposes (the tables are just copies of all_users and all_objects views):

SELECT
    u.username
  , (SELECT MAX(created) FROM test_objects o WHERE o.owner = u.username)
FROM
    test_users u
WHERE
    username LIKE 'S%'

Now, why would you want to write the query this way is a different story. I think it’s actually pretty rare when you need to use a scalar subquery, you usually can get away with an outer join. I have used scalar subqueries for populating some return values in cases where adding a yet another (outer) join to the query would complicate the query too much for my brain (as there are some limitations how you can arrange outer joins). I have only done this when I know that the query result-set (on which the scalar subquery is executed once for every row returned, unless the subquery caching kicks in!) will only return a handful of rows and the extra effort of running the scalar subquery once for each row is acceptable.

Nevertheless, non-experienced SQL developers (who come from the procedural coding world) write lots of scalar subqueries, even up to the point of having every single column populated by a scalar subquery! And this can be a serious problem as this breaks the query into separate non-mergeable chunks, which means that the CBO isn’t as free to move things around – resulting in suboptimal plans.

So, starting from Oracle 12c (and maybe even 11.2.0.4?), the CBO transformation engine can unnest some types of the scalar subqueries and convert these to outer joins internally.

The following example is executed on Oracle 12.1.0.1.0 with explicitly disabling the scalar subquery unnesting (using the NO_UNNEST hint):

SELECT /*+ GATHER_PLAN_STATISTICS NO_UNNEST(@ssq) */
    u.username
  , (SELECT /*+ QB_NAME(ssq) */ MAX(created) FROM test_objects o WHERE o.owner = u.username)
FROM
    test_users u
WHERE
    username LIKE 'S%'
/

--------------------------------------------------------------------------------------------
| Id  | Operation                  | Name         | Starts | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |              |      1 |      5 |00:00:00.01 |       4 |
|   1 |  SORT AGGREGATE            |              |      5 |      5 |00:00:00.01 |    1365 |
|*  2 |   TABLE ACCESS STORAGE FULL| TEST_OBJECTS |      5 |  12538 |00:00:00.01 |    1365 |
|*  3 |  TABLE ACCESS STORAGE FULL | TEST_USERS   |      1 |      5 |00:00:00.01 |       4 |
--------------------------------------------------------------------------------------------

This somewhat funky-looking plan above (where you have multiple children directly under the SELECT STATEMENT rowsource operator) indicates that there’s some scalar subquerying going on in the SQL statement. From the Starts column we see that the outer TEST_USERS table has been scanned only once, but the TEST_OBJECTS table referenced inside the scalar subquery has been scanned 5 times, once for each row returned from the outer query (A-rows). So the scalar subquerying behaves somewhat like a FILTER loop, where it executes the subquery once for each row passed to it by the parent query.

So, let’s remove the NO_UNNEST subquery hint and hope for the 12c improvement to kick in:

SELECT /*+ GATHER_PLAN_STATISTICS */
    u.username
  , (SELECT MAX(created) FROM test_objects o WHERE o.owner = u.username)
FROM
    test_users u
WHERE
    username LIKE 'S%'
/

The resulting plan shows that indeed the transformation got done – the scalar subquery was unnested from being a loop under SELECT STATEMENT and the whole query got converted into an equivalent outer join followed by a GROUP BY:

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Starts | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |      1 |      5 |00:00:00.01 |     276 |       |       |          |
|   1 |  HASH GROUP BY              |              |      1 |      5 |00:00:00.01 |     276 |  1079K|  1079K|  874K (0)|
|*  2 |   HASH JOIN OUTER           |              |      1 |  12541 |00:00:00.03 |     276 |  1421K|  1421K| 1145K (0)|
|*  3 |    TABLE ACCESS STORAGE FULL| TEST_USERS   |      1 |      5 |00:00:00.01 |       3 |  1025K|  1025K|          |
|*  4 |    TABLE ACCESS STORAGE FULL| TEST_OBJECTS |      1 |  12538 |00:00:00.01 |     273 |  1025K|  1025K|          |
------------------------------------------------------------------------------------------------------------------------

Both tables have been scanned through only once and the total number of buffer gets has dropped from 1365 to 276 (but the new group by & hash join steps do likely use some extra CPU plus workarea memory). Nevertheless, in a real-life case, flattening and de-compartmentalizing a big query will open up more join order and data access optimization opportunities for the CBO, resulting in better plans.

Whenever the scalar subquery transformation is done, the following entry shows up in the optimizer trace:

*****************************
Cost-Based Subquery Unnesting
*****************************
SU: Unnesting query blocks in query block SEL$1 (#1) that are valid to unnest.
Subquery Unnesting on query block SEL$1 (#1)
SU: Performing unnesting that does not require costing.
SU: Considering subquery unnest on query block SEL$1 (#1).
SU:   Unnesting  scalar subquery query block SEL$2 (#2)
Registered qb: SEL$683B0107 0x637de5a0 (SUBQ INTO VIEW FOR COMPLEX UNNEST SEL$2)

And possibly later in the tracefile, this:

SU:   Unnesting  scalar subquery query block SEL$2 (#2)

When searching for the final query form in the CBO trace, we see it’s just an outer join with a group by. The group by clause has U.ROWID in it too, possibly because I didn’t have a primary key on the users table (you’re welcome to rerun a modified test case and report the results)

Final query after transformations:******* UNPARSED QUERY IS *******

SELECT "U"."USERNAME" "USERNAME",
  MAX("O"."CREATED") "(SELECTMAX(CREATED)FROMTEST_OB"
FROM "TANEL"."TEST_OBJECTS" "O",
  "TANEL"."TEST_USERS" "U"
WHERE "U"."USERNAME" LIKE 'S%'
AND "O"."OWNER"(+)="U"."USERNAME"
AND "O"."OWNER"(+) LIKE 'S%'
GROUP BY "O"."OWNER",
  "U".ROWID,
  "U"."USERNAME"

Note that my query uses MAX(created) for fetching something from the scalar subquery. When I use a COUNT(*) instead of MAX(created), the scalar subquery unnesting does not kick in, with a following message in the CBO tracefile:

SU: bypassed: Scalar subquery has null-mutating select item.

It looks like the transformation would not be (guaranteed to be) valid in this case, so this query with COUNT(*) didn’t get converted to the outer join + group by combo.
Also, when your scalar subquery can possibly return more than one row for a lookup value (which would return an error during execution time anyway), the transformation does not kick in either, with the following message in CBO tracefile:

SU: bypassed: Scalar subquery may return more than one row.

This transformation is controlled by the _optimizer_unnest_scalar_sq parameter.
A small test case is here http://blog.tanelpoder.com/files/scripts/12c/scalar_subquery_unnesting.sql.

And that’s it, expect more cool stuff soon! ;-)

↧

C-MOS: How to get rid of the annoying “The Page has Expired” dialog in My Oracle Support

August 19, 2013, 5:00 am

≫ Next: Enkitec is a finalist for the UKOUG Engineered Systems Partner of the Year Award

≪ Previous: Oracle 12c: Scalar Subquery Unnesting transformation

So, how many of you do hate the dialog below?

Good news – there is a fix! (or well, a hack around it ;)

Before showing the fix, you can vote & give your opinion here:

Do you love or hate the MOS “page expired” dialog?

The fix is actually super-simple. The page expiration dialog that grays out the browser screen is just a HTML DIV with ID DhtmlZOrderManagerLayerContainer, overlaying the useful content. If you want it to disappear, you need to delete or hide that DIV in the HTML DOM tree. The javascript code is below (I just updated it so it should work properly on IE too):

javascript:(function(){document.getElementById('DhtmlZOrderManagerLayerContainer').style.display='none';})()

Install the fix as a bookmarklet

For convenience, you just need to “install” this javascript code as a bookmarklet. So just drag this C-MOS link to the bookmarks bar (don’t just click on it here, just drag it to its place).

So, next time you see a MOS page expired in one of your many open browser tabs, you just click on the C-MOS bookmarklet in the bookmarks bar (instead of the OK button) and the grayed out dialog box disappears – without trying to reload the whole page (and failing). So you’ll still be able to read, copy and scroll the content.

Note 1: I didn’t try to prevent this (client-side) expiration from happening as you might still want to maintain your web session with MOS servers by clicking OK in the dialog.

Note 2: If you have some ad-blockers or javascript blockers enabled in your browser, the C-MOS link may be re-written by them to “javascript:void(0);” – so the bookmarklet wouldn’t do anything. In this case either temporarily disable the blocker and refresh this page (and then drag the link to a bookmarklet position) or just add it manually with the code above.

Note 3: As I updated the bookmarklet to support IE too, the new code is a bit different. The old code, for those who are interested, is here: javascript:$(DhtmlZOrderManagerLayerContainer).remove();

If you wonder why I picked such a name C-MOS for this little script, just pronounce it and you’ll see ;-)

↧

Enkitec is a finalist for the UKOUG Engineered Systems Partner of the Year Award

August 20, 2013, 7:52 am

≫ Next: Scalar Subqueries in Oracle SQL WHERE clauses (and a little bit of Exadata stuff too)

≪ Previous: C-MOS: How to get rid of the annoying “The Page has Expired” dialog in My Oracle Support

Enkitec has made it to the shortlist of UKOUG Partner of the Year Awards, in the Engineered Systems category. So if you like what we have done in the Exadata and Engineered Systems space, please cast your vote! :-)

Note that you need to be an Oracle user – using your company email address in order to vote. (the rules are explained here).

Thanks!!!

↧

Scalar Subqueries in Oracle SQL WHERE clauses (and a little bit of Exadata stuff too)

August 22, 2013, 2:03 pm

≫ Next: Oracle Performance & Troubleshooting Online Seminars in 2013

≪ Previous: Enkitec is a finalist for the UKOUG Engineered Systems Partner of the Year Award

My previous post was about Oracle 12c SQL Scalar Subquery transformations. Actually I need to clarify its scope a bit: the previous post was about scalar subqueries inside a SELECT projection list only (meaning that for populating a field in the query resultset, a subquery gets executed once for each row returned back to the caller, instead of returning a “real” column value passed up from a child rowsource).

I did not cover an other use case in my previous post – it is possible to use scalar subqueries also in the WHERE clause, for filtering the resultset, so let’s see what happens in this case too!

Note that the tests below are ran on an Oracle 11.2.0.3 database (not 12c as in the previous post), because I want to add a few Exadata details to this post – and as of now, 18th August 2013, Smart Scans don’t work with Oracle 12c on Exadata. This will of course change once the first Oracle 12c patchset will be released, but this will probably happen somewhere in the next year.

So, let’s look into the following simple query. The bold red part is the scalar subquery (well, as long as it returns 0 or 1 rows, if it returns more, you’ll get an error during query execution). I’m searching for “objects” from a test_objects_100m table (with 100 Million rows in it), but I only want to process the rows where the object’s owner name is whatever the subquery on test_users table returns. I have also disabled Smart Scans for this query so that the database would behave more like a regular non-Exadata DB for now:

SELECT /*+ MONITOR OPT_PARAM('cell_offload_processing', 'false') */
    SUM(LENGTH(object_name)) + SUM(LENGTH(object_type)) + SUM(LENGTH(owner))
FROM
    test_objects_100m o
WHERE
    o.owner = (SELECT u.username FROM test_users u WHERE user_id = 13)
/

Note the equals (=) sign above, I’m simply looking for a single, noncorrelated value from the subquery – it’s not a more complex (and unpredictable!) IN or EXISTS subquery. Let’s see the execution plan, pay attention to the the table names and execution order below:

------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |     1 |    33 |   405K  (1)| 00:52:52 |
|   1 |  SORT AGGREGATE                        |                   |     1 |    33 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL            | TEST_OBJECTS_100M |  7692K|   242M|   405K  (1)| 00:52:52 |
|*  3 |    TABLE ACCESS STORAGE FULL FIRST ROWS| TEST_USERS        |     1 |    12 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

   2 - filter("O"."OWNER"= (SELECT "U"."USERNAME" FROM "TEST_USERS" "U" WHERE "USER_ID"=13))
   3 - filter("USER_ID"=13)

That sure is a weird-looking execution plan, right?

The TABLE ACCESS FULL at line #2 has a child rowsource which also happens to be a TABLE ACCESS FULL feeding rows to the parent? Well, this is what happens when the query transformation engine pushes the scalar subquery “closer to the data” it’s supposed to be filtering, so that the WHERE user_id = 13 subquery result (line #3) gets evaluated once, first in the SQL execution data flow pipeline. Actually it’s slightly more complex, before evaluating the subquery #3, Oracle makes sure that there’s at least one row to be retrieved (and surviving any simple filters) from the table in #2. In other words, thanks to evaluating the scalar subquery first, Oracle can filter rows based on the subquery output value at the earliest possible point – right after extracting the rows from the data blocks (using the codepath in data layer). And as you’ll see later, even push the resulting value to Exadata storage cells for even earlier filtering in there.

The first place where I usually look, when checking whether some transformation magic was applied to the query, is the outline hints section of the execution plan (which you can get with the ADVANCED or +OUTLINE options of DBMS_XPLAN):

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('11.2.0.3')
      DB_VERSION('11.2.0.3')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$2")
      OUTLINE_LEAF(@"SEL$1")
      FULL(@"SEL$1" "O"@"SEL$1")
      PUSH_SUBQ(@"SEL$2")
      FULL(@"SEL$2" "U"@"SEL$2")
      END_OUTLINE_DATA
  */

Indeed, there’s a PUSH_SUBQ hint in the outline section. So, as subqueries in a WHERE clause exist solely for producing data for filtering the parent query blocks rows, the PUSH_SUBQ means that we push the subquery evaulation deeper in the plan, deeper than the parent query block’s data access path, between the access path itself and data layer, which extracts the data from datablocks of tables and indexes. This should allow us to filter earlier, reducing the row counts in an earlier stage in the plan, thus not having to pass so many of them around in the plan tree, hopefully saving time and resources.

So, let’s see what kind of plan do we get if we disable that particular subuqery pushing transformation, by changing the PUSH_SUBQ hint to NO_PUSH_SUBQ (most of the hints in Oracle 11g+ have NO_ counterparts, which are useful for experimenting and even as fixes/workarounds of optimizer problems):

SELECT /*+ MONITOR NO_PUSH_SUBQ(@"SEL$2") OPT_PARAM('cell_offload_processing', 'false') test3b */
    SUM(LENGTH(object_name)) + SUM(LENGTH(object_type)) + SUM(LENGTH(owner))
FROM
    test_objects_100m o
WHERE
    o.owner = (SELECT u.username FROM test_users u WHERE user_id = 13)
/

And here’s the plan:

------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |     1 |    33 |   406K  (1)| 00:52:55 |
|   1 |  SORT AGGREGATE                        |                   |     1 |    33 |            |          |
|*  2 |   FILTER                               |                   |       |       |            |          |
|   3 |    TABLE ACCESS STORAGE FULL           | TEST_OBJECTS_100M |   100M|  3147M|   406K  (1)| 00:52:55 |
|*  4 |    TABLE ACCESS STORAGE FULL FIRST ROWS| TEST_USERS        |     1 |    12 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

   2 - filter("O"."OWNER"= (SELECT /*+ NO_PUSH_SUBQ */ "U"."USERNAME" FROM "TEST_USERS" "U" WHERE
              "USER_ID"=13))
   4 - filter("USER_ID"=13)

Now, both of the tables are at the same level in the execution plan tree and it’s the parent FILTER loop operation’s task to fetch (all the) rows from its children, for performing the comparison for filtering. See how the estimated row count from the larger table is 100 million now, as opposed to rougly 7 million in the previous plan. Note that these are just optimizer’s estimates, so let’s look into the SQL Monitoring details for real figures.

The original plan with subquery pushing transformation used 16 seconds worth of CPU time - and roughtly all of that CPU time was spent in execution plan line #2. Opening & “parsing” data block contents and comparing rows for filtration sure takes noticeable CPU time if done on big enough dataset. The full table scan on TEST_OBJECTS_100M table (#2) returned 43280 rows, after filtering with the help of the pushed subquery (#3) result value, to its parent operation (#1):

===============================================================================
| Elapsed |   Cpu   |    IO    | Application | Fetch | Buffer | Read  | Read  |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs  | Bytes |
===============================================================================
|      47 |      16 |       31 |        0.00 |     1 |     1M | 11682 |  11GB |
===============================================================================

SQL Plan Monitoring Details (Plan Hash Value=3981286852)
===========================================================================================================================================
| Id |                Operation                 |       Name        | Execs |   Rows   | Read  | Read  | Activity |    Activity Detail    |
|    |                                          |                   |       | (Actual) | Reqs  | Bytes |   (%)    |      (# samples)      |
===========================================================================================================================================
|  0 | SELECT STATEMENT                         |                   |     1 |        1 |       |       |          |                       |
|  1 |   SORT AGGREGATE                         |                   |     1 |        1 |       |       |          |                       |
|  2 |    TABLE ACCESS STORAGE FULL             | TEST_OBJECTS_100M |     1 |    43280 | 11682 |  11GB |   100.00 | Cpu (16)              |
|    |                                          |                   |       |          |       |       |          | direct path read (31) |
|  3 |     TABLE ACCESS STORAGE FULL FIRST ROWS | TEST_USERS        |     1 |        1 |       |       |          |                       |
===========================================================================================================================================

Unfortunately the standard SQL rowsource-level metrics don’t tell us how many rows were retrieved from the tables during the full table scan (the table scan rows gotten v$sesstat metric would help there somewhat). Nevertheless, we happen to know that this scanned table contains 100M rows – and as I’ve disabled the Smart Scan offloading, we can assume that all these rows/blocks were scanned through.

Below are the metrics for the query with NO_PUSH_SUBQ hint, so that the FILTER operation (#2) is responsible for comparing data and filtering the rows. Note that the full table scan at line #3 returns 100 million rows to the parent FILTER operation – which throws most of these away. Still roughly 16 secons of CPU time have been spent in the full table scan operation (#3) and the FILTER operation (#2) where the actual filtering now takes place used 2 seconds of CPU, which indicates that the actual data comparison / filtering code takes a small amount of CPU time, compared to the cycles needed for extracting the data from their blocks (plus all kinds of checks, like calculating and checking block checksums as we are doing physical IOs here):

===========================================================================================================================================
| Id |                Operation                 |       Name        | Execs |   Rows   | Read  | Read  | Activity |    Activity Detail    |  
|    |                                          |                   |       | (Actual) | Reqs  | Bytes |   (%)    |      (# samples)      |
===========================================================================================================================================
|  0 | SELECT STATEMENT                         |                   |     1 |        1 |       |       |          |                       |  
|  1 |   SORT AGGREGATE                         |                   |     1 |        1 |       |       |          |                       |  
|  2 |    FILTER                                |                   |     1 |    43280 |       |       |     4.35 | Cpu (2)               |  
|  3 |     TABLE ACCESS STORAGE FULL            | TEST_OBJECTS_100M |     1 |     100M | 11682 |  11GB |    95.65 | Cpu (16)              |
|    |                                          |                   |       |          |       |       |          | direct path read (28) |  
|  4 |     TABLE ACCESS STORAGE FULL FIRST ROWS | TEST_USERS        |     1 |        1 |       |       |          |                       |  
===========================================================================================================================================

So far there’s a noticeable, but not radical difference in query runtime and CPU usage, but nevertheless, the subquery pushing in the earlier example did help a little. And it sure looks better when the execution plan does not pass hundreds of millions of rows around (even if there are some optimizations to pipe data structures by reference within a process internally). Note that these measurements are not very precise for short queries as ASH samples session state data only once per second (you could rerun these tests with a 10 Billion row table if you like to get more stable figures :)

Anyway, things get much more interesting when repeated with Exadata Smart Scan offloading enabled!

Runtime Difference of Scalar Subquery Filtering with Exadata Smart Scans

I’m running the same query, with both subquery pushing and smart scans enabled:

SELECT /*+ MONITOR PUSH_SUBQ(@"SEL$2") OPT_PARAM('cell_offload_processing', 'true') test4a */ 
    SUM(LENGTH(object_name)) + SUM(LENGTH(object_type)) + SUM(LENGTH(owner)) 
FROM 
    test_objects_100m o 
WHERE 
    o.owner = (SELECT u.username FROM test_users u WHERE user_id = 13) 
/

Let’s see the stats:

=========================================================================================
| Elapsed |   Cpu   |    IO    | Application | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs  | Bytes | Offload |
=========================================================================================
|    2.64 |    0.15 |     2.49 |        0.00 |     1 |     1M | 11682 |  11GB |  99.96% |
=========================================================================================

SQL Plan Monitoring Details (Plan Hash Value=3981286852)
=========================================================================================================================================================
| Id |                Operation                 |       Name        | Execs |   Rows   | Read  | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                          |                   |       | (Actual) | Reqs  | Bytes | Offload |   (%)    |        (# samples)        |
=========================================================================================================================================================
|  0 | SELECT STATEMENT                         |                   |     1 |        1 |       |       |         |          |                           |
|  1 |   SORT AGGREGATE                         |                   |     1 |        1 |       |       |         |          |                           |
|  2 |    TABLE ACCESS STORAGE FULL             | TEST_OBJECTS_100M |     1 |    43280 | 11682 |  11GB |  99.96% |   100.00 | cell smart table scan (3) |
|  3 |     TABLE ACCESS STORAGE FULL FIRST ROWS | TEST_USERS        |     1 |        1 |       |       |         |          |                           |
=========================================================================================================================================================

Wow, it’s the same table, same server, the same query, but with Smart Scans enabled, it takes only 2.6 seconds to run the query (compared to previous 47 seconds). The database level CPU usage has dropped over 100X – from 16+ seconds to only 0.15 seconds! This is because the offloading kicked in, so the storage cells spent their CPU time opening blocks, checksumming them and extracting the contents – and filtering in the storage cell of course (the storage cell CPU time is not accounted in the DB level V$SQL views).

The Cell Offload Efficiency for the line #2 in the above plan is 99.96%, which means that only 0.04% worth of bytes, compared to the scanned segment size (~11GB), was sent back by the smart scan in storage cells. 0.04% of 11GB is about 4.5 MB. Thanks to the offloading, most of the filtering was done in the storage cells (in parallel), so the database layer did not end up spending much CPU on final processing (summing the lengths of columns specified in SELECT list together).

See the storage predicate offloading below:

------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |     1 |    33 |   405K  (1)| 00:52:52 |
|   1 |  SORT AGGREGATE                        |                   |     1 |    33 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL            | TEST_OBJECTS_100M |  7692K|   242M|   405K  (1)| 00:52:52 |
|*  3 |    TABLE ACCESS STORAGE FULL FIRST ROWS| TEST_USERS        |     1 |    12 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

   2 - storage("O"."OWNER"= (SELECT /*+ PUSH_SUBQ */ "U"."USERNAME" FROM "TEST_USERS" "U" WHERE
              "USER_ID"=13))
       filter("O"."OWNER"= (SELECT /*+ PUSH_SUBQ */ "U"."USERNAME" FROM "TEST_USERS" "U" WHERE
              "USER_ID"=13))
   3 - storage("USER_ID"=13)
       filter("USER_ID"=13)

So, somehow the whole complex predicate got offloaded to the storage cell! The catch here is that this is a simple, scalar subquery in the WHERE clause – with equality sign, and is not using some more complex IN / EXISTS construct. So, it looks like the scalar subquery (#3) got executed first and its result value got sent to the storage cells, just like a regular constant predicate. In other words, it’s not the subquery itself that got sent to storage cells (it’s impossible with the current architecture anyway), it’s the result of that subquery that got executed in the DB layer and used in an offloaded predicate. In my case the USER_ID = 13 resolved to username “OUTLN” and the storage predicate on line #2 ended up something like “WHERE o.owner = ‘OUTLN’“.

And finally, let’s check the FILTER based (NO_PUSH_SUBQ) approach with smart scan enabled, to see what do we lose if the subquery pushing doesn’t kick in:

SELECT /*+ MONITOR NO_PUSH_SUBQ(@"SEL$2") OPT_PARAM('cell_offload_processing', 'true') test4b */
    SUM(LENGTH(object_name)) + SUM(LENGTH(object_type)) + SUM(LENGTH(owner))
FROM
    test_objects_100m o
WHERE
    o.owner = (SELECT u.username FROM test_users u WHERE user_id = 13)
/

The query takes over 8 seconds of CPU time now in the database layer (despite some sampling inaccuracies in the Activity Detail that comes from ASH data):

Global Stats
=========================================================================================
| Elapsed |   Cpu   |    IO    | Application | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs  | Bytes | Offload |
=========================================================================================
|      11 |    8.46 |     2.65 |        0.00 |     1 |     1M | 13679 |  11GB |  68.85% |
=========================================================================================

SQL Plan Monitoring Details (Plan Hash Value=3231668261)
=========================================================================================================================================================
| Id |                Operation                 |       Name        | Execs |   Rows   | Read  | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                          |                   |       | (Actual) | Reqs  | Bytes | Offload |   (%)    |        (# samples)        |
=========================================================================================================================================================
|  0 | SELECT STATEMENT                         |                   |     1 |        1 |       |       |         |          |                           |
|  1 |   SORT AGGREGATE                         |                   |     1 |        1 |       |       |         |          |                           |
|  2 |    FILTER                                |                   |     1 |    43280 |       |       |         |          |                           |
|  3 |     TABLE ACCESS STORAGE FULL            | TEST_OBJECTS_100M |     1 |     100M | 13679 |  11GB |  68.85% |   100.00 | Cpu (4)                   |
|    |                                          |                   |       |          |       |       |         |          | cell smart table scan (7) |
|  4 |     TABLE ACCESS STORAGE FULL FIRST ROWS | TEST_USERS        |     1 |        1 |       |       |         |          |                           |
=========================================================================================================================================================

See how the Cell Offload % has dropped too – as the Cell Offload Efficiency for the line #3 is 68.85%, it means that the smart scan returned about 11 GB * 31.15% = 3.4 GB of data back this time! The difference is entirely because now we must send back (the requested columns) of all rows in the table! This is visible also from the Actual Rows column above, the full table scan at line #3 sends 100M rows to its parent (FILTER), which then throw most of them away. It’s actually surprising that we don’t see any CPU activity samples for the FILTER (as comparing and filtering 100M rows does take some CPU), but it’s probably again an ASH sampling luck issue. Run the query with 10-100x bigger data set and you should definitely see such ASH samples caught.

When we look into the predicate section of this inferior execution plan, we see no storage() predicate on the biggest table – actually there’s no predicate whatsoever on line #3, the filtering happens later, in the parent FILTER step:

------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |     1 |    33 |   406K  (1)| 00:52:55 |
|   1 |  SORT AGGREGATE                        |                   |     1 |    33 |            |          |
|*  2 |   FILTER                               |                   |       |       |            |          |
|   3 |    TABLE ACCESS STORAGE FULL           | TEST_OBJECTS_100M |   100M|  3147M|   406K  (1)| 00:52:55 |
|*  4 |    TABLE ACCESS STORAGE FULL FIRST ROWS| TEST_USERS        |     1 |    12 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

   2 - filter("O"."OWNER"= (SELECT /*+ NO_PUSH_SUBQ */ "U"."USERNAME" FROM "TEST_USERS" "U" WHERE
              "USER_ID"=13))
   4 - storage("USER_ID"=13)
       filter("USER_ID"=13)

So, in conclusion – scalar subqueries in WHERE clauses do provide a little benefit (reduced CPU usage) in all Oracle databases, but on Exadata they may have a much bigger positive impact. Just make sure you do see the storage() predicate on the relevant plan lines scanning the big tables. And always keep in mind that the existence of a storage() predicate doesn’t automatically mean that smart scan did kick in for your query execution – always check the exadata specific execution metrics when running your query.

↧

Oracle Performance & Troubleshooting Online Seminars in 2013

September 5, 2013, 6:45 am

≫ Next: I will be speaking at Oracle OpenWorld and Strata + HadoopWorld NY

≪ Previous: Scalar Subqueries in Oracle SQL WHERE clauses (and a little bit of Exadata stuff too)

In case you haven’t noticed, I will be delivering my Advanced Oracle Troubleshooting and Advanced Oracle Exadata Performance: Troubleshooting and Optimization classes again in Oct/Nov 2013 (AOT) and December 2013 (Exadata).

I have streteched the Exadata class to 5-half days as 4 half-days wasn’t nearly enough to deliver the amount of details in the material (and I think it’s still going to be a pretty intensive pace).

And that’s all for this year (I will write about conferences and other public appearances in a separate post).

↧

I will be speaking at Oracle OpenWorld and Strata + HadoopWorld NY

September 5, 2013, 7:57 am

≫ Next: Advanced Oracle Troubleshooting Guide – Part 11: Complex Wait Chain Signature Analysis with ash_wait_chains.sql

≪ Previous: Oracle Performance & Troubleshooting Online Seminars in 2013

I will be speaking at a few more conferences this year and thought to add some comments about my plans here too. Here’s the list of my upcoming presentations:

Oracle OpenWorld, 22-26 September 2013, San Francisco

Session: Moving Data Between Oracle Exadata and Hadoop. Fast.
When: Wednesday, Sep 25. 3:30pm
Where: Moscone South 305
What: I have been doing quite a lot of work on the optimal Oracle/Exadata <-> Hadoop connectivity and data migration lately, so thought that it’s worth sharing. The fat infiniBand pipe between an Exadata box and a Hadoop cluster running in the Oracle Big Data Appliance gives pretty interesting results.

OakTableWorld, 23-24 September 2013, San Francisco

Session: Hacking Oracle Database
When: Tuesday, Sep 24. 1:00pm
Where: Imagination Lab @ Creativity Museum near Moscone
What: The “secret” OakTableWorld event (once called Oracle Closed World :) is an unofficial, fun (but also serious) satellite event during OOW. Lots of great technical speakers and topics. Might also have free beer :) I will deliver a-yet-another hacking session without much structure – I’ll just show how I research and explore Oracle’s new low-level features and which tools & approaches I use etc. Should be fun.

Strata Conference + HadoopWorld NY, 28-30 October 2013, New York City, NY

Session: Hadoop Internals for Oracle Developers and DBAs
When: Wednesday, Oct 30. 3:45pm
Where: Murray Hill Suite @ NYC Hilton Midtown
What: I will deliver an updated version of my Hadoop for Oracle DBAs session, where I explain the main similarities and differences between running a parallel query on an Oracle RAC database vs a MapReduce job on a Hadoop cluster.

I have also submitted abstracts to RMOUG Training Days 2014 and will deliver the Training Day at Hotsos Symposium 2014!

So see you at any of these conferences!

Public Appearances and Exadata Performance Training

↧

Advanced Oracle Troubleshooting Guide – Part 11: Complex Wait Chain Signature Analysis with ash_wait_chains.sql

September 11, 2013, 6:23 pm

≫ Next: Subscribing via email / RSS

≪ Previous: I will be speaking at Oracle OpenWorld and Strata + HadoopWorld NY

Here’s a treat for the hard-core Oracle performance geeks out there – I’m releasing a cool, but still experimental script for ASH (or poor-man’s ASH)-based wait event analysis, which should add a whole new dimension into ASH based performance analysis. It doesn’t replace any of the existing ASH analysis techniques, but should bring the relationships between Oracle sessions in complex wait chains out to bright daylight much easier than before.

You all are familiar with the AWR/Statspack timed event summary below:

Similar breakdown can be gotten by just aggregating ASH samples by the wait event:

SQL> @ash/dashtop session_state,event 1=1 "TIMESTAMP'2013-09-09 21:00:00'" "TIMESTAMP'2013-09-09 22:00:00'"

%This  SESSION EVENT                                                            TotalSeconds        CPU   User I/O Application Concurrency     Commit Configuration    Cluster       Idle    Network System I/O  Scheduler Administrative   Queueing      Other MIN(SAMPLE_TIME)                                                            MAX(SAMPLE_TIME)
------ ------- ---------------------------------------------------------------- ------------ ---------- ---------- ----------- ----------- ---------- ------------- ---------- ---------- ---------- ---------- ---------- -------------- ---------- ---------- --------------------------------------------------------------------------- ---------------------------------------------------------------------------
  68%  ON CPU                                                                          25610      25610          0           0           0          0             0          0          0          0          0          0              0          0          0 09-SEP-13 09.00.01.468 PM                                                   09-SEP-13 09.59.58.059 PM
  14%  WAITING SQL*Net more data from client                                            5380          0          0           0           0          0             0          0          0       5380          0          0              0          0          0 09-SEP-13 09.00.01.468 PM                                                   09-SEP-13 09.59.58.059 PM
   6%  WAITING enq: HW - contention                                                     2260          0          0           0           0          0          2260          0          0          0          0          0              0          0          0 09-SEP-13 09.04.41.893 PM                                                   09-SEP-13 09.56.07.626 PM
   3%  WAITING log file parallel write                                                  1090          0          0           0           0          0             0          0          0          0       1090          0              0          0          0 09-SEP-13 09.00.11.478 PM                                                   09-SEP-13 09.59.58.059 PM
   2%  WAITING db file parallel write                                                    730          0          0           0           0          0             0          0          0          0        730          0              0          0          0 09-SEP-13 09.01.11.568 PM                                                   09-SEP-13 09.59.48.049 PM
   2%  WAITING enq: TX - contention                                                      600          0          0           0           0          0             0          0          0          0          0          0              0          0        600 09-SEP-13 09.04.41.893 PM                                                   09-SEP-13 09.48.16.695 PM
   1%  WAITING buffer busy waits                                                         560          0          0           0         560          0             0          0          0          0          0          0              0          0          0 09-SEP-13 09.10.02.492 PM                                                   09-SEP-13 09.56.07.626 PM
   1%  WAITING log file switch completion                                                420          0          0           0           0          0           420          0          0          0          0          0              0          0          0 09-SEP-13 09.47.16.562 PM                                                   09-SEP-13 09.47.16.562 PM
   1%  WAITING latch: redo allocation                                                    330          0          0           0           0          0             0          0          0          0          0          0              0          0        330 09-SEP-13 09.04.41.893 PM                                                   09-SEP-13 09.53.27.307 PM
...

The abovementioned output has one shortcoming in a multiuser (database) system – not all wait events are simple, where a session waits for OS to complete some self-contained operation (like an IO request). Often a session waits for another session (who holds some lock) or some background process who needs to complete some task before our session can continue. That other session may wait for a yet another session due to some other lock. The other session itself waits for a yet another one, thanks to some buffer pin (buffer busy wait). The session who holds the buffer pin, may itself be waiting for LGWR, who may in turn wait for DBWR etc… You get the point – sometimes we have a bunch of sessions, waiting for each other in a chain.

The V$WAIT_CHAINS view introduced in Oracle 11g is capable of showing such chains of waiting sessions – however it is designed to diagnose relatively long-lasting hangs, not performance problems and short (but non-trivial) contention. Usually the DIAG process, who’s responsible for walking through the chains and populating V$WAIT_CHAINS, doesn’t kick in after a few seconds of ongoing session waits and the V$WAIT_CHAINS view may be mostly empty – so we need something different for performance analysis.

Now, it is possible to pick one of the waiting sessions and use the blocking_session column to look up the blocking SID – and see what that session was doing and so on. I used to do this somewhat manually, but realized that a simple CONNECT BY loop with some ASH sample iteration trickery can easily give us complex wait chain signature & hierarchy information.

So, I am presenting to you my new ash_wait_chains.sql script! (Woo-hoo!). It’s actually completely experimental right now, I’m not fully sure whether it returns correct results and some of the planned syntax isn’t implemented yet :) EXPERIMENTAL, not even beta :) And no RAC nor DBA_HIST ASH support yet.

In the example below, I am running the script with parameter event2, which is just really a pseudocolumn on ASH view (look in the script). So it shows just the wait events of the sessions involved in a chain. You can use any ASH column there for extra information, like program, module, sql_opname etc:

SQL> @ash/ash_wait_chains event2 1=1 sysdate-1 sysdate

-- ASH Wait Chain Signature script v0.1 EXPERIMENTAL by Tanel Poder ( http://blog.tanelpoder.com )

%This     SECONDS WAIT_CHAIN
------ ---------- ------------------------------------------------------------------------------------------
  60%       77995 -> ON CPU
  10%       12642 -> cell single block physical read
   9%       11515 -> SQL*Net more data from client
   2%        3081 -> log file parallel write
   2%        3073 -> enq: HW - contention  -> cell smart file creation
   2%        2723 -> enq: HW - contention  -> buffer busy waits  -> cell smart file creation
   2%        2098 -> cell smart table scan
   1%        1817 -> db file parallel write
   1%        1375 -> latch: redo allocation  -> ON CPU
   1%        1023 -> enq: TX - contention  -> buffer busy waits  -> cell smart file creation
   1%         868 -> block change tracking buffer space
   1%         780 -> enq: TX - contention  -> buffer busy waits  -> ASM file metadata operation
   1%         773 -> latch: redo allocation
   1%         698 -> enq: TX - contention  -> buffer busy waits  -> DFS lock handle
   0%         529 -> enq: TX - contention  -> buffer busy waits  -> control file parallel write
   0%         418 -> enq: HW - contention  -> buffer busy waits  -> DFS lock handle

What does the above output tell us? I have highlighted 3 rows above, let’s say that we want to get more insight about the enq: HW – contention wait event. The ash_wait_chains script breaks down the wait events by the “complex” wait chain signature (who’s waiting for whom), instead of just the single wait event name! You read the wait chain information from left to right. For example 2% of total response time (3073 seconds) in the analysed ASH dataset was spent by a session, who was waiting for a HW enqueue, which was held by another session, who itself happened to be waiting for the cell smart file creation wait event. The rightmost wait event is the “ultimate blocker”. Unfortunately ASH doesn’t capture idle sessions by default (an idle session may still be holding a lock and blocking others), so in the current version (0.1) of this script, the idle blockers are missing from the output. You can still manually detect a missing / idle session as we’ll see later on.

So, this script allows you to break down the ASH wait data not just by a single, “scalar” wait event, but all wait events (and other attributes) of a whole chain of waiting sessions – revealing hierarchy and dependency information in your bottleneck! Especially when looking into the chains involving various background processes, we gain interesting insight into the process flow in an Oracle instance.

As you can add any ASH column to the output (check the script, it’s not too long), let’s add the program info too, so we’d know what kind of sessions were involved in the complex waits (I have removed a bunch of less interesting lines from the output and have highlighted some interesting ones (again, read the chain of waiters from left to right):

SQL> @ash/ash_wait_chains program2||event2 1=1 sysdate-1 sysdate

-- ASH Wait Chain Signature script v0.1 EXPERIMENTAL by Tanel Poder ( http://blog.tanelpoder.com )

%This     SECONDS WAIT_CHAIN
------ ---------- -----------------------------------------------------------------------------------------------------------------------------------------------
  56%       73427 -> (JDBC Thin Client) ON CPU
   9%       11513 -> (JDBC Thin Client) SQL*Net more data from client
   9%       11402 -> (oracle@enkxndbnn.enkitec.com (Jnnn)) cell single block physical read
   2%        3081 -> (LGWR) log file parallel write
   2%        3073 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) cell smart file creation
   2%        2300 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) buffer busy waits  -> (JDBC Thin Client) cell smart file creation
...
   1%        1356 -> (JDBC Thin Client) latch: redo allocation  -> (JDBC Thin Client) ON CPU
   1%        1199 -> (JDBC Thin Client) cell single block physical read
   1%        1023 -> (JDBC Thin Client) enq: TX - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) cell smart file creation
   1%         881 -> (CTWR) ON CPU
   1%         858 -> (JDBC Thin Client) block change tracking buffer space
   1%         780 -> (JDBC Thin Client) enq: TX - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) ASM file metadata operation
   1%         766 -> (JDBC Thin Client) latch: redo allocation
   1%         698 -> (JDBC Thin Client) enq: TX - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) DFS lock handle
   0%         529 -> (JDBC Thin Client) enq: TX - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) control file parallel write
   0%         423 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) cell smart file creation
   0%         418 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) ASM file metadata operation
   0%         418 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) DFS lock handle
...
   0%          25 -> (JDBC Thin Client) gcs drm freeze in enter server mode  -> (LMON) ges lms sync during dynamic remastering and reconfig  -> (LMSn) ON CPU
   0%          25 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) DFS lock handle  -> (DBWn) db file parallel write
...
   0%          18 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) CSS operation: action
   0%          18 -> (LMON) control file sequential read
   0%          17 -> (LMSn) ON CPU
   0%          17 -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) cell smart file creation
   0%          16 -> (JDBC Thin Client) enq: FB - contention  -> (JDBC Thin Client) gcs drm freeze in enter server mode  -> (LMON) ges lms sync during dynamic remastering and reconfig  -> (LMSn) ON CPU
...
   0%           3 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) buffer busy waits  -> (JDBC Thin Client) CSS operation: action
   0%           3 -> (JDBC Thin Client) buffer busy waits  -> (JDBC Thin Client) latch: redo allocation  -> (JDBC Thin Client) ON CPU
...
   0%           1 -> (JDBC Thin Client) enq: TX - contention  -> (JDBC Thin Client) enq: TX - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) ASM file metadata operation
   0%           1 -> (Wnnn) buffer busy waits  -> (JDBC Thin Client) block change tracking buffer space
...

In a different example, when adding the sql_opname column (11.2+) we can also display the type of SQL command that ended up waiting (like UPDATE, DELETE, LOCK etc) which is useful for lock contention analysis:

SQL> @ash/ash_wait_chains sql_opname||':'||event2 1=1 sysdate-1/24/60 sysdate

-- Display ASH Wait Chain Signatures script v0.1 EXPERIMENTAL by Tanel Poder ( http://blog.tanelpoder.com )

%This     SECONDS WAIT_CHAIN
------ ---------- ------------------------------------------------------------------------------------------------------------------------------------------------------
  18%          45 -> LOCK TABLE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention
  18%          45 -> LOCK TABLE:enq: TM - contention
  17%          42 -> DELETE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention
  10%          25 -> SELECT:direct path read
   6%          14 -> TRUNCATE TABLE:enq: TM - contention
   5%          13 -> SELECT:ON CPU
   2%           5 -> LOCK TABLE:enq: TM - contention  -> SELECT:ON CPU
   2%           5 -> TRUNCATE TABLE:enq: TM - contention  -> SELECT:ON CPU
   2%           5 -> SELECT:db file scattered read
   2%           5 -> DELETE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention  -> SELECT:db file scattered read
   2%           5 -> DELETE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention  -> SELECT:ON CPU
   2%           5 -> LOCK TABLE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention  -> SELECT:db file scattered read
   2%           5 -> TRUNCATE TABLE:enq: TM - contention  -> SELECT:db file scattered read
   2%           5 -> LOCK TABLE:enq: TM - contention  -> SELECT:db file scattered read
   2%           5 -> LOCK TABLE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention  -> SELECT:ON CPU
   2%           4 -> TRUNCATE TABLE:enq: TM - contention  -> SELECT:direct path read
   2%           4 -> LOCK TABLE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention  -> SELECT:direct path read
   2%           4 -> DELETE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention  -> SELECT:direct path read
   2%           4 -> LOCK TABLE:enq: TM - contention  -> SELECT:direct path read

This is an experimental script and has some issues and shortcomings right now. For example, if the final blocking session itself was idle during the sampling, like the blue highlighted line above, then the blocking session is not shown there – as ASH doesn’t capture idle session samples. Also it doesn’t work for RAC right now (and this might be problematic as in RAC the blocking_session info for global wait events may not immediately be resolved either, unlike in the instance-local cases). And I will try to find better example cases illustrating the use of this technique (you’re welcome to send me your output from busy databases for analysis :)

Note that I will be doing my Advanced Oracle Troubleshooting class again in Oct/Nov 2013, so now you know where to learn more cool (and even more useful) Oracle troubleshooting stuff ;-)

↧

Subscribing via email / RSS

September 17, 2013, 10:20 am

≫ Next: Why doesn’t ALTER SYSTEM SET EVENTS set the events or tracing immediately?

≪ Previous: Advanced Oracle Troubleshooting Guide – Part 11: Complex Wait Chain Signature Analysis with ash_wait_chains.sql

FYI I have added the “subscribe by email” form to my blog – just scroll downwards once you see the “Subscribe via Email” form in the left side of the screen:

As you see, you can subscribe via RSS too…

↧

Why doesn’t ALTER SYSTEM SET EVENTS set the events or tracing immediately?

October 7, 2013, 8:23 am

≫ Next: Enkitec Rocks Again!!!

≪ Previous: Subscribing via email / RSS

I received a question about ALTER SYSTEM in the comments section of another blog post recently.

Basically the question was that while ALTER SESSION SET EVENTS ’10046 … ‘ enabled the SQL Trace for the current session immediately, ALTER SYSTEM on the other hand didn’t seem to do anything at all for other sessions in the instance.

There’s an important difference in the behavior of ALTER SYSTEM when changing paramters vs. setting events.

For example, ALTER SYSTEM SET optimizer_mode = CHOOSE would change the value of this parameter immediately, for:

Your own session
All new sessions that will log in will pick up the new parameter value
All other existing sessions

However, when you issue an ALTER SYSTEM SET EVENTS ’10046 TRACE NAME CONTEXT FOREVER, LEVEL 12′, the event changes in only #1 and #2 will happen:

Your own session
All new sessions that will log in will pick up the new event settings

This means that the existing, already logged in sessions, will not pick up any of the events set via ALTER SYSTEM!

This hopefully explains why sometimes the debug events don’t seem to work. But more importantly, this also means that when you disable an event (by setting it to “OFF” or to level 0) with ALTER SYSTEM, it does not affect the existing sessions who have this event enabled! So, you think you’re turning the tracing off for all sessions and go home, but really some sessions keep on tracing – until the filesystem is full (and you’ll get a phone call at 3am).

So, to be safe, you should use DBMS_MONITOR for your SQL Tracing needs, it doesn’t have the abovementioned problems. For other events you should use DBMS_SYSTEM.SET_EV/READ_EV (or ORADEBUG EVENT/SESSION_EVENT & EVENTS/EVENTDUMP) together with ALTER SYSTEM for making sure you actually do enable/disable the events for all existing sessions too. Or better yet, stay away from undocumented events ;-)

If you wonder what/where is the “system event array”, it’s just a memory location in shared pool. It doesn’t seem to be explicitly visible in V$SGASTAT in Oracle 10g, but in 11.2.0.3 you get this:

No system-wide events set:

SQL> @sgastat event

POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  DBWR event stats array            216
shared pool  KSQ event description            8460
shared pool  Wait event pointers               192
shared pool  dbgdInitEventGrp: eventGr         136
shared pool  event classes                    1552
shared pool  event descriptor table          32360
shared pool  event list array to post           36
shared pool  event list to post commit         108
shared pool  event statistics per sess     2840096
shared pool  event statistics ptr arra         992
shared pool  event-class map                  4608
shared pool  ksws service events             57260
shared pool  latch wait-event table           2212
shared pool  standby event stats              1216
shared pool  sys event stats                539136
shared pool  sys event stats for Other       32256
shared pool  trace events array              72000

17 rows selected.

Let’s set a system-wide event:

SQL> ALTER SYSTEM SET events = '942 TRACE NAME ERRORSTACK LEVEL 3'; 

System altered.

And check V$SGASTAT again:

SQL> @sgastat event

POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  DBWR event stats array            216
shared pool  KSQ event description            8460
shared pool  Wait event pointers               192
shared pool  dbgdInitEventG                   4740
shared pool  dbgdInitEventGrp: eventGr         340
shared pool  dbgdInitEventGrp: subHeap          80
shared pool  event classes                    1552
shared pool  event descriptor table          32360
shared pool  event list array to post           36
shared pool  event list to post commit         108
shared pool  event statistics per sess     2840096
shared pool  event statistics ptr arra         992
shared pool  event-class map                  4608
shared pool  ksws service events             57260
shared pool  latch wait-event table           2212
shared pool  standby event stats              1216
shared pool  sys event stats                539136
shared pool  sys event stats for Other       32256
shared pool  trace events array              72000

19 rows selected.

So, the “system event array” lives in shared pool, as a few memory allocations with name like “dbgdInitEventG%”. Note that this naming was different in 10g, as the dbgd module showed up in Oracle 11g, when Oracle guys re-engineered the whole diagnostics event infrastructure, making it much more powerful, for example allowing you to enable dumps and traces only for a specific SQL_ID.

Ok, that’s enough for today – I’ll just remind you that this year’s last Advanced Oracle Troubleshooting online seminar starts in 2 weeks (and I will be talking more about event setting there ;-)

↧

Enkitec Rocks Again!!!

October 7, 2013, 10:17 am

≫ Next: SGA bigger than than the amount of HugePages configured (Linux – 11.2.0.3)

≪ Previous: Why doesn’t ALTER SYSTEM SET EVENTS set the events or tracing immediately?

Yes, the title of this post is also a reference to the Machete Kills Again (…in Space) movie that I’m sure going to check out once it’s out ;-)

Enkitec won the UKOUG Engineered Systems Partner of the Year award last Thursday and I’m really proud of that. We got seriously started in Europe a bit over 2 years ago, have put in a lot of effort and by now it’s paying off. In addition to happy customers, interesting projects and now this award, we have awesome people in 4 countries in Europe already – and more to come. Also, you did know that Frits Hoogland and Martin Bach have joined us in Europe, right? I actually learn stuff from these guys! :-) Expect more awesome announcements soon! ;-)

Thank you, UKOUG team!

Also, if you didn’t follow all the Oracle OpenWorld action, we also got two major awards at OOW this year! (least year “only” one ;-)). We received two 2013 Oracle Excellence Awards for Specialized Partner of the Year (North America) for our work in both Financial Services and Energy/Utilities industries.

Thanks, Oracle folks! :)

Check out this Oracle Partner Network’s video interview with Martin Paynter where he explains where we come from and how we get things done:

Oh, by the way, Enkitec is hiring ;-)

↧

SGA bigger than than the amount of HugePages configured (Linux – 11.2.0.3)

October 25, 2013, 6:23 am

≫ Next: Diagnosing buffer busy waits with the ash_wait_chains.sql script (v0.2)

≪ Previous: Enkitec Rocks Again!!!

I just learned something new yesterday when demoing large page use on Linux during my AOT seminar.

I had 512 x 2MB hugepages configured in Linux ( 1024 MB ). So I set the USE_LARGE_PAGES = TRUE (it actually is the default anyway in 11.2.0.2+). This allows the use of large pages (it doesn’t force, the ONLY option would force the use of hugepages, otherwise the instance wouldn’t start up). Anyway, the previous behavior with hugepages was, that if Oracle was not able to allocate the entire SGA from the hugepages area, it would silently allocate the entire SGA from small pages. It was all or nothing. But to my surprise, when I set my SGA_MAX_SIZE bigger than the amount of allocated hugepages in my testing, the instance started up and the hugepages got allocated too!

It’s just that the remaining part was allocated as small pages, as mentioned in the alert log entry below (and in the latest documentation too – see the link above):

Thu Oct 24 20:58:47 2013
ALTER SYSTEM SET sga_max_size='1200M' SCOPE=SPFILE;
Thu Oct 24 20:58:54 2013
Shutting down instance (abort)
License high water mark = 19
USER (ospid: 18166): terminating the instance
Instance terminated by USER, pid = 18166
Thu Oct 24 20:58:55 2013
Instance shutdown complete
Thu Oct 24 20:59:52 2013
Adjusting the default value of parameter parallel_max_servers
from 160 to 135 due to the value of parameter processes (150)
Starting ORACLE instance (normal)
****************** Large Pages Information *****************

Total Shared Global Region in Large Pages = 1024 MB (85%)

Large Pages used by this instance: 512 (1024 MB)
Large Pages unused system wide = 0 (0 KB) (alloc incr 16 MB)
Large Pages configured system wide = 512 (1024 MB)
Large Page size = 2048 KB

RECOMMENDATION:
  Total Shared Global Region size is 1202 MB. For optimal performance,
  prior to the next instance restart increase the number
  of unused Large Pages by atleast 89 2048 KB Large Pages (178 MB)
  system wide to get 100% of the Shared
  Global Region allocated with Large pages
***********************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0

The ipcs -m command confirmed this (multiple separate shared memory segments had been created).

Note that despite what documentation says, there’s a 4th option, called AUTO for the USE_LARGE_PAGES parameter too (in 11.2.0.3+ I think), which can now ask the OS to increase the number of hugepages when instance starts up – but I would always try to pre-allocate the right number of hugepages from start (ideally right after the OS reboot – via sysctl.conf), to reduce any overhead potential kernel CPU usage spikes due to search and defragmentation effort for “building” large consecutive pages.

Marting Bach has written about the AUTO option here.

↧

Diagnosing buffer busy waits with the ash_wait_chains.sql script (v0.2)

November 6, 2013, 5:54 pm

≫ Next: When do Oracle Parallel Execution Slaves issue buffered physical reads – Part 1?

≪ Previous: SGA bigger than than the amount of HugePages configured (Linux – 11.2.0.3)

In my previous post ( Advanced Oracle Troubleshooting Guide – Part 11: Complex Wait Chain Signature Analysis with ash_wait_chains.sql ) I introduced an experimental script for analysing performance from “top ASH wait chains” perspective. The early version (0.1) of the script didn’t have the ability to select a specific user (session), SQL or module/action performance data for analysis. I just hadn’t figured out how to write the SQL for this (as the blocking sessions could come from any user). So it turns out it was just matter of using “START WITH <condition>” in the connect by loop. Now the paramter 2 of this script (whose activty to measure) actually works.

Download the latest version of:

ash_wait_chains.sql (ASH wait chain analysis with the V$ACTIVE_SESSION_HISTORY data)
dash_wait_chains.sql (ASH wait chain analysis with the DBA_HIST_ACTIVE_SESS_HISTORY data)

So, now the parameter #2 is actually used, for example the username=’TANEL’ syntax below means that we will list only Tanel’s sessions as top-level waiters in the report, but of course as a Tanel’s session may be blocked by any other user’s session, this where clause doesn’t restrict displaying any of the blockers, regardless of their username:

@ash_wait_chains program2||event2 username='TANEL' sysdate-1/24 sysdate

I have added one more improvement, which you’ll see in a moment. So here’s a problem case. I was performance testing parallel loading of a 1TB “table” from Hadoop to Oracle (on Exadata). I was using external tables (with the Hadoop SQL Connector) and here’s the SQL Monitor report’s activity chart:

A large part of the response time was spent waiting for buffer busy waits! So, normally the next step here would be to check:

Check the type (and location) of the block involved in the contention – and also whether there’s a single very “hot” block involved in this or are there many different “warm” blocks that add up. Note that I didn’t say “block causing contention” here, as block is just a data structure, it doesn’t cause contention – it’s the sessions who lock this block, do.
Who’s the session holding this lock (pin) on the buffer – and is there a single blocking session causing all this or many different sessions that add up to the problem.
What are the blocking sessions themselves doing (e.g. are they stuck waiting for something else themselves?)

Let’s use the “traditional” approach first. As I know the SQL ID and this SQL runtime (from the SQL Monitoring report for example), I can query ASH records with my ashtop.sql script (warning, wide output):

SQL> @ash/ashtop session_state,event sql_id='3rtbs9vqukc71' "timestamp'2013-10-05 01:00:00'" "timestamp'2013-10-05 03:00:00'"

%This  SESSION EVENT                                                            TotalSeconds        CPU   User I/O Application Concurrency     Commit Configuration    Cluster       Idle    Network System I/O  Scheduler Administrative   Queueing      Other MIN(SAMPLE_TIME)                                                            MAX(SAMPLE_TIME)
------ ------- ---------------------------------------------------------------- ------------ ---------- ---------- ----------- ----------- ---------- ------------- ---------- ---------- ---------- ---------- ---------- -------------- ---------- ---------- --------------------------------------------------------------------------- ---------------------------------------------------------------------------
  57%  WAITING buffer busy waits                                                       71962          0          0           0       71962          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.09.923 AM                                                   05-OCT-13 02.45.54.106 AM
  35%  ON CPU                                                                          43735      43735          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.34.55.903 AM                                                   05-OCT-13 02.47.28.232 AM
   6%  WAITING direct path write                                                        6959          0       6959           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.07.923 AM                                                   05-OCT-13 02.47.21.232 AM
   1%  WAITING external table read                                                      1756          0       1756           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.02.913 AM                                                   05-OCT-13 02.47.15.222 AM
   0%  WAITING local write wait                                                          350          0        350           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 02.02.40.034 AM                                                   05-OCT-13 02.46.59.202 AM
   0%  WAITING control file parallel write                                               231          0          0           0           0          0             0          0          0          0        231          0              0          0          0 05-OCT-13 01.35.22.953 AM                                                   05-OCT-13 02.47.15.222 AM
   0%  WAITING cell smart file creation                                                  228          0        228           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.09.923 AM                                                   05-OCT-13 02.47.26.232 AM
   0%  WAITING DFS lock handle                                                           194          0          0           0           0          0             0          0          0          0          0          0              0          0        194 05-OCT-13 01.35.15.933 AM                                                   05-OCT-13 02.47.14.222 AM
   0%  WAITING cell single block physical read                                           146          0        146           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.12.933 AM                                                   05-OCT-13 02.47.09.212 AM
   0%  WAITING control file sequential read                                               63          0          0           0           0          0             0          0          0          0         63          0              0          0          0 05-OCT-13 01.35.17.953 AM                                                   05-OCT-13 02.46.56.192 AM
   0%  WAITING change tracking file synchronous read                                      57          0          0           0           0          0             0          0          0          0          0          0              0          0         57 05-OCT-13 01.35.26.963 AM                                                   05-OCT-13 02.40.32.677 AM
   0%  WAITING db file single write                                                       48          0         48           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.38.21.317 AM                                                   05-OCT-13 02.41.55.794 AM
   0%  WAITING gc current grant 2-way                                                     19          0          0           0           0          0             0         19          0          0          0          0              0          0          0 05-OCT-13 01.35.06.923 AM                                                   05-OCT-13 02.45.46.096 AM
   0%  WAITING kfk: async disk IO                                                         13          0          0           0           0          0             0          0          0          0         13          0              0          0          0 05-OCT-13 01.42.34.791 AM                                                   05-OCT-13 02.38.19.485 AM
   0%  WAITING resmgr:cpu quantum                                                          9          0          0           0           0          0             0          0          0          0          0          9              0          0          0 05-OCT-13 01.36.09.085 AM                                                   05-OCT-13 01.59.08.635 AM
   0%  WAITING enq: CR - block range reuse ckpt                                            7          0          0           0           0          0             0          0          0          0          0          0              0          0          7 05-OCT-13 02.12.42.069 AM                                                   05-OCT-13 02.40.46.687 AM
   0%  WAITING latch: redo allocation                                                      3          0          0           0           0          0             0          0          0          0          0          0              0          0          3 05-OCT-13 02.10.01.807 AM                                                   05-OCT-13 02.10.01.807 AM
   0%  WAITING Disk file operations I/O                                                    2          0          2           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.41.13.639 AM                                                   05-OCT-13 01.43.50.951 AM
   0%  WAITING enq: XL - fault extent map                                                  2          0          0           0           0          0             0          0          0          0          0          0              0          0          2 05-OCT-13 01.35.34.983 AM                                                   05-OCT-13 01.35.34.983 AM
   0%  WAITING external table open                                                         2          0          2           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.02.913 AM                                                   05-OCT-13 01.35.02.913 AM

57% of the DB time was spent waiting for buffer busy waits. So let’s check the P2/P3 to see which block# (in a datafile) and block class# we are dealing with:

SQL> @ash/ashtop session_state,event,p2text,p2,p3text,p3 sql_id='3rtbs9vqukc71' "timestamp'2013-10-05 01:00:00'" "timestamp'2013-10-05 03:00:00'"

%This  SESSION EVENT                                                            P2TEXT                                 P2 P3TEXT                                 P3 TotalSeconds        CPU   User I/O Application Concurrency     Commit Configuration    Cluster       Idle    Network System I/O  Scheduler Administrative   Queueing      Other MIN(SAMPLE_TIME)                                                            MAX(SAMPLE_TIME)
------ ------- ---------------------------------------------------------------- ------------------------------ ---------- ------------------------------ ---------- ------------ ---------- ---------- ----------- ----------- ---------- ------------- ---------- ---------- ---------- ---------- ---------- -------------- ---------- ---------- --------------------------------------------------------------------------- ---------------------------------------------------------------------------
  57%  WAITING buffer busy waits                                                block#                                  2 class#                                 13        71962          0          0           0       71962          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.09.923 AM                                                   05-OCT-13 02.45.54.106 AM
  31%  ON CPU                                                                   file#                                   0 size                               524288        38495      38495          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.05.923 AM                                                   05-OCT-13 02.47.25.232 AM
   1%  WAITING external table read                                              file#                                   0 size                               524288         1756          0       1756           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.02.913 AM                                                   05-OCT-13 02.47.15.222 AM
   1%  ON CPU                                                                   block#                                  2 class#                                 13          945        945          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.16.943 AM                                                   05-OCT-13 02.45.10.056 AM
   0%  ON CPU                                                                   consumer group id                   12573                                         0          353        353          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.34.56.903 AM                                                   05-OCT-13 01.59.59.739 AM
   0%  WAITING cell smart file creation                                                                                 0                                         0          228          0        228           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.09.923 AM                                                   05-OCT-13 02.47.26.232 AM
   0%  WAITING DFS lock handle                                                  id1                                     3 id2                                     2          193          0          0           0           0          0             0          0          0          0          0          0              0          0        193 05-OCT-13 01.35.15.933 AM                                                   05-OCT-13 02.47.14.222 AM
   0%  ON CPU                                                                   file#                                  41 size                                   41          118        118          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.34.56.903 AM                                                   05-OCT-13 01.35.02.913 AM
   0%  WAITING cell single block physical read                                  diskhash#                      4004695794 bytes                                8192           85          0         85           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.12.933 AM                                                   05-OCT-13 02.47.09.212 AM
   0%  WAITING control file parallel write                                      block#                                  1 requests                                2           81          0          0           0           0          0             0          0          0          0         81          0              0          0          0 05-OCT-13 01.35.22.953 AM                                                   05-OCT-13 02.41.54.794 AM
   0%  WAITING control file parallel write                                      block#                                 41 requests                                2           74          0          0           0           0          0             0          0          0          0         74          0              0          0          0 05-OCT-13 01.35.31.983 AM                                                   05-OCT-13 02.47.15.222 AM
   0%  WAITING change tracking file synchronous read                            blocks                                  1                                         0           57          0          0           0           0          0             0          0          0          0          0          0              0          0         57 05-OCT-13 01.35.26.963 AM                                                   05-OCT-13 02.40.32.677 AM
   0%  WAITING control file parallel write                                      block#                                 42 requests                                2           51          0          0           0           0          0             0          0          0          0         51          0              0          0          0 05-OCT-13 01.35.23.953 AM                                                   05-OCT-13 02.47.10.212 AM
   0%  WAITING db file single write                                             block#                                  1 blocks                                  1           48          0         48           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.38.21.317 AM                                                   05-OCT-13 02.41.55.794 AM
   0%  ON CPU                                                                                                           0                                         0           31         31          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.19.953 AM                                                   05-OCT-13 02.44.32.006 AM
   0%  WAITING control file parallel write                                      block#                                 39 requests                                2           21          0          0           0           0          0             0          0          0          0         21          0              0          0          0 05-OCT-13 01.36.35.125 AM                                                   05-OCT-13 02.39.30.575 AM
   0%  WAITING control file sequential read                                     block#                                  1 blocks                                  1           20          0          0           0           0          0             0          0          0          0         20          0              0          0          0 05-OCT-13 01.35.17.953 AM                                                   05-OCT-13 02.46.56.192 AM
   0%  ON CPU                                                                   locn                                    0                                         0           19         19          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.34.983 AM                                                   05-OCT-13 02.30.34.786 AM
   0%  ON CPU                                                                   fileno                                  0 filetype                                2           16         16          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.36.08.075 AM                                                   05-OCT-13 02.44.26.996 AM
   0%  WAITING kfk: async disk IO                                               intr                                    0 timeout                        4294967295           13          0          0           0           0          0             0          0          0          0         13          0              0          0          0 05-OCT-13 01.42.34.791 AM                                                   05-OCT-13 02.38.19.485 AM

Buffer busy waits on block #2 of a datafile? Seems familiar … But instead of guessing or dumping the block (to see what type it is) we can just check what the block class# 13 is:

SQL> @bclass 13

CLASS              UNDO_SEGMENT_ID
------------------ ---------------
file header block

So, the block #2 is something called a “file header block”. Actually don’t confuse it with block #1 which is the real datafile header block (as we know it from the concepts guide), the block #2 is actually the LMT tablespace space management bitmap header block (from the P1 value I saw the file# was 16, so I dumped the block with ALTER SYSTEM DUMP DATAFILE 16 BLOCK 2 command):

Start dump data blocks tsn: 21 file#:16 minblk 2 maxblk 2
Block dump from cache:
Dump of buffer cache at level 4 for tsn=21 rdba=2
BH (0x1e1f25718) file#: 16 rdba: 0x00000002 (1024/2) class: 13 ba: 0x1e1876000
  set: 71 pool: 3 bsz: 8192 bsi: 0 sflg: 1 pwc: 2,22
  dbwrid: 2 obj: -1 objn: 1 tsn: 21 afn: 16 hint: f
  hash: [0x24f5a3008,0x24f5a3008] lru: [0x16bf3c188,0x163eee488]
  ckptq: [NULL] fileq: [NULL] objq: [0x173f10230,0x175ece830] objaq: [0x22ee19ba8,0x16df2f2c0]
  st: SCURRENT md: NULL fpin: 'ktfbwh00: ktfbhfmt' tch: 162 le: 0xcefb4af8
  flags: foreground_waiting block_written_once redo_since_read
  LRBA: [0x0.0.0] LSCN: [0x0.0] HSCN: [0xffff.ffffffff] HSUB: [2]
Block dump from disk:
buffer tsn: 21 rdba: 0x00000002 (1024/2)
scn: 0x0001.13988a3a seq: 0x02 flg: 0x04 tail: 0x8a3a1d02
frmt: 0x02 chkval: 0xee04 type: 0x1d=KTFB Bitmapped File Space Header
Hex dump of block: st=0, typ_found=1
Dump of memory from 0x00007F6708C71800 to 0x00007F6708C73800
7F6708C71800 0000A21D 00000002 13988A3A 04020001  [........:.......]
7F6708C71810 0000EE04 00000400 00000008 08CEEE00  [................]

So, the block number 2 is actually the first LMT space bitmap block. Every time you need to manage LMT space (allocate extents, search for space in the file or release extents), you will need to pin the block #2 of the datafile – and everyone else who tries to do the same has to wait.

Now the next question should be – who’s blocking us – who’s holding this block pinned so much then? And what is the blocker itself doing? This is where the ASH wait chains script comes into play – instead of me manually looking up the “blocking_session” column value, I’m just using a CONNECT BY to look up what the blocker itself was doing:

SQL> @ash/ash_wait_chains event2 sql_id='3rtbs9vqukc71' "timestamp'2013-10-05 01:00:00'" "timestamp'2013-10-05 03:00:00'"

-- Display ASH Wait Chain Signatures script v0.2 BETA by Tanel Poder ( http://blog.tanelpoder.com )

%This     SECONDS WAIT_CHAIN
------ ---------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  35%       43735 -> ON CPU
  15%       18531 -> buffer busy waits [file header block]  -> ON CPU
   7%        9266 -> buffer busy waits [file header block]  -> control file parallel write
   7%        8261 -> buffer busy waits [file header block]  -> cell smart file creation
   6%        6959 -> direct path write
   5%        6707 -> buffer busy waits [file header block]  -> DFS lock handle
   4%        4658 -> buffer busy waits [file header block]  -> local write wait
   4%        4610 -> buffer busy waits [file header block]  -> cell single block physical read
   3%        4282 -> buffer busy waits [file header block]  -> local write wait  -> db file parallel write
   2%        2801 -> buffer busy waits [file header block]
   2%        2676 -> buffer busy waits [file header block]  -> ASM file metadata operation
   2%        2092 -> buffer busy waits [file header block]  -> change tracking file synchronous read
   2%        2050 -> buffer busy waits [file header block]  -> control file sequential read

If you follow the arrows, 15% of response time of this SQL was spent waiting for buffer busy waits [on the LMT space header block], while the blocker (we don’t know who yet) was itself on CPU – doing something. There’s 7+7% percent waited on buffer busy waits, while the blocker itself was either waiting for “control file parallel write” or “cell smart file creation” – the offloaded datafile extension wait event.

Let’s see who blocked us (which username and which program):

SQL> @ash/ash_wait_chains username||':'||program2||event2 sql_id='3rtbs9vqukc71' "timestamp'2013-10-05 01:00:00'" "timestamp'2013-10-05 03:00:00'"

-- Display ASH Wait Chain Signatures script v0.2 BETA by Tanel Poder ( http://blog.tanelpoder.com )

%This     SECONDS WAIT_CHAIN
------ ---------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  35%       43732 -> TANEL:(Pnnn) ON CPU
  13%       16908 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) ON CPU
   6%        6959 -> TANEL:(Pnnn) direct path write
   4%        4838 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> SYS:(Wnnn) control file parallel write
   4%        4428 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) control file parallel write
   3%        4166 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) cell smart file creation
   3%        4095 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> SYS:(Wnnn) cell smart file creation
   3%        3607 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> SYS:(Wnnn) DFS lock handle
   3%        3147 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) local write wait
   2%        3117 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) local write wait  -> SYS:(DBWn) db file parallel write
   2%        3100 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) DFS lock handle
   2%        2801 -> TANEL:(Pnnn) buffer busy waits [file header block]
   2%        2764 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) cell single block physical read
   2%        2676 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> SYS:(Wnnn) ASM file metadata operation
   1%        1825 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> SYS:(Wnnn) cell single block physical read

Now we also see who was blocking us? In the line with 13% response time, we were blocked by another TANEL user running a PX slave (I’m replacing digits to “n” in the background processes, like P000, for collapsing them all into a single line). In line #4 above with th 4% of wait time, TANEL’s PX slave was blocked by SYS running a Wnnn process – it’s one of the workers for space management stuff like asynchronous datafile pre-extension. So, looks like the datafile extension bottleneck has a role to play in this! I could pre-extend the datafile in advance myself (it’s a bigfile tablespace) when aniticipating a big data load that has to happen fast – but the first thing I would do is check whether the tablespace is an LMT tablespace with UNIFORM extent sizing policy and the minimum extent size is big enough for my data load. In other words, your biggest fact tables with heavy data loads should reside in an LMT tablespace with UNIFORM extent size of multiple MB (I’ve used between 32MB and 256MB extent size). The AUTOALLOCATE doesn’t really save you much – as with autoallocate there’s a LMT space management bitmap bit per each 64kB chunk of the file – there’s much more overhead when searching and allocating a big extent consisting of many small LMT “chunks” vs a big extent which consists of only a single big LMT chunk.

I will come back to the topic of buffer busy waits on block #2 in a following post – this post aims to show that you can pretty much use any ASH column you like to more details about who was doing what in the entire wait chain.

And another topic for the future would be the wait events which do not populate blocking session info proprely (depending on version, most latches and mutexes for example). The manual approach would still be needed there – although with mutexes is possible to extract the (exclusive) blocking SID from the wait event’s PARAMETER2 value. Maybe in ash_wait_chains v0.3 :-)

↧

When do Oracle Parallel Execution Slaves issue buffered physical reads – Part 1?

November 27, 2013, 2:32 am

≫ Next: When do Oracle Parallel Execution Slaves issue buffered physical reads – Part 2?

≪ Previous: Diagnosing buffer busy waits with the ash_wait_chains.sql script (v0.2)

This post applies both to non-Exadata and Exadata systems.

Before Oracle 11.2 came out, it was true to say that Oracle Parallel Execution slaves always do direct path reads (bypassing buffer cache) when doing full segment scans. This should not be taken simplistically though. Even when you were doing full table scans, then yes the scanning was done with direct path multiblock reads – but if you had to visit other, additional blocks out of the scanning sequence, then these extra IOs were done with regular buffered reads. For example, next row piece fetching of chained rows or or undo block access for CR reads was done with buffered single block reads, or even buffered multiblock reads, if some form of prefetching kicked in.

In addition to that, random table/index accesses like index range scans and the following table block fetches are always done in a buffered way both in serial and parallel execution cases.

Starting from Oracle 11.2 though, Oracle parallel execution slaves can also do the parallel full segment scans via the buffer cache. The feature is called In-Memory Parallel Execution, not to be confused with the Oracle 12c upcoming In-Memory Option (which gives you a columnar, compressed, in-memory cache of your on-disk data).

The 11.2 in-memory parallel execution doesn’t introduce any new data formats, but just allows you to cache your hottest tables across buffer caches of multiple RAC nodes (different nodes hold different “fragments” of a segment) and allow PX slaves to avoid physical disk reads and even work entirely from the local buffer cache of the RAC node the PX slave is running on (data locality!). You should use Oracle 11.2.0.3+ for in-mem PX as it has some improvements and bugfixes for node- and NUMA-affinity for PX slaves.

So, this is a great feature - if disk scanning, data retrieval IO is your bottleneck (and you have plenty of memory). But on Exadata, the storage cells give you awesome disk scanning speeds and data filtering/projection/decompression offloading anyway. If you use buffered reads, then you won’t use Smart Scans – as smart scans need direct path reads as a prerequisite. And if you don’t use smart scans, the Exadata storage cells will just act as block IO servers for you – and even if you have all the data cached in RAM, your DB nodes (compute nodes) would be used for all the filtering and decompression of billions of rows. Also, there’s no storage indexing in memory (well, unless you use zone-maps which do a similar thing at higher level in 12c).

So, long story short, you likely do not want to use In-Memory PX on Exadata – and even on non-Exadata, you probably do not want it to kick in automatically at “random” times without you controlling this. So, this leads to the question of when does the in-memory PX kick in and how it is controlled?

Long story short – the in-memory PX can kick when either of the following options is true:

When _parallel_cluster_cache_policy = CACHED (this will be set so from default when parallel_degree_policy = AUTO)
When the segment (or partition) is marked as CACHE or KEEP, for example:
- ALTER TABLE t STORAGE (BUFFER_POOL KEEP)
- ALTER TABLE t CACHE

So, while #1 is relatively well known behavior, the #2 is not. So, the In-Memory PX can kick in even if your parallel_degree_policy = MANUAL!

The default value for _parallel_cluster_cache_policy is ADAPTIVE – and that provides a clue. I read it the way that when set to ADAPTIVE, then the in-memory PX caching decision is “adaptively” done based on whether the segment is marked as CACHE/KEEP in data dictionary or not. My brief tests showed that when marking an object for buffer pool KEEP, then Oracle (11.2.0.3) tries to do the in-memory PX even if the combined KEEP pool size in the RAC cluster was smaller than the segment itself. Therefore, all scans ended up re-reading all blocks in from disk to the buffer cache again (and throwing “earlier” blocks of the segment out). When the KEEP pool was not configured, then the DEFAULT buffer cache was used (even if the object was marked to be in the KEEP pool). So, when marking your objects for keep or cache, make sure you have enough buffer cache allocated for this.

Note that with parallel_degree_policy=AUTO & _parallel_cluster_cache_policy=CACHED, Oracle tries to be more intelligent about this, allowing only up to _parallel_cluster_cache_pct (default value 80) percent of total buffer cache in the RAC cluster to be used for in-memory PX. I haven’t tested this deep enough to say I know how the algorithm works. So far I have preferred the manual CACHE/KEEP approach for the latest/hottest partitions (in a few rare cases that I’ve used it).

So, as long as you don’t mark your tables/indexes/partitions as CACHE or KEEP and use parallel_degree_policy=MANUAL or LIMITED, you should not get the in-mem PX to kick in and all parallel full table scans should get nicely offloaded on Exadata (unless you hit any of the other limitations that block a smart scan from happening).

Update 1: I just wrote Part 2 for this article too.

Update 2: Frits Hoogland pointed me to a bug/MOS Note which may be interesting to you if you use KEEP pool:

KEEP BUFFER POOL Does Not Work for Large Objects on 11g (Doc ID 1081553.1):

Due to this bug, tables with size >10% of cache size, were being treated as ‘large tables’ for their reads and this resulted in execution of a new SERIAL_DIRECT_READ path in 11g.

With the bug fix applied, any object in the KEEP buffer pool, whose size is less than DB_KEEP_CACHE_SIZE, is considered as a small or medium sized object. This will cache the read blocks and avoid subsequent direct read for these objects .

↧

When do Oracle Parallel Execution Slaves issue buffered physical reads – Part 2?

November 27, 2013, 3:32 am

≫ Next: Hard Drive Predictive Failures on Exadata

≪ Previous: When do Oracle Parallel Execution Slaves issue buffered physical reads – Part 1?

In the previous post about in-memory parallel execution I described in which cases the in-mem PX can kick in for your parallel queries.

A few years ago (around Oracle 11.2.0.2 and Exadata X2 release time) I was helping a customer with their migration to Exadata X2. Many of the queries ran way slower on Exadata compared to their old HP Superdome. The Exadata system was configured according to the Oracle’s “best practices”, that included setting the parallel_degree_policy = AUTO.

As there were thousands of reports (and most of them had performance issues) and we couldn’t extract SQL Monitoring reports (due to another issue) I just looked into ASH data for a general overview. A SQL Monitoring report takes the execution plan line time breakdown from ASH anyway.

First I ran a simple ASH query which counted the ASH samples (seconds spent) and grouped the results by the rowsource type (I was using a custom script then, but you could achieve the same with running @ash/ashtop "sql_plan_operation||' '||sql_plan_options" session_type='FOREGROUND' sysdate-1/24 sysdate for example):

SQL> @ash_custom_report

PLAN_LINE                                            COUNT(*)        PCT
-------------------------------------------------- ---------- ----------
TABLE ACCESS STORAGE FULL                              305073       47.6
                                                        99330       15.5
LOAD AS SELECT                                          86802       13.6
HASH JOIN                                               37086        5.8
TABLE ACCESS BY INDEX ROWID                             20341        3.2
REMOTE                                                  13981        2.2
HASH JOIN OUTER                                          8914        1.4
MAT_VIEW ACCESS STORAGE FULL                             7807        1.2
TABLE ACCESS STORAGE FULL FIRST ROWS                     6348          1
INDEX RANGE SCAN                                         4906         .8
HASH JOIN BUFFERED                                       4537         .7
PX RECEIVE                                               4201         .7
INSERT STATEMENT                                         3601         .6
PX SEND HASH                                             3118         .5
PX SEND BROADCAST                                        3079         .5
SORT AGGREGATE                                           3074         .5
BUFFER SORT                                              2266         .4
SELECT STATEMENT                                         2259         .4
TABLE ACCESS STORAGE SAMPLE                              2136         .3
INDEX UNIQUE SCAN                                        2090         .3

The above output shows that indeed over 47% of Database Time in ASH history has been spent in TABLE ACCESS STORAGE FULL operations (regardless of which SQL_ID). The blank 15.5% is activity by non-SQL plan execution stuff, like parsing, PL/SQL, background process activity, logins/connection management etc. But only about 3.2+0.8+0.3 = 4.3% of the total DB Time was spent in index related row sources. So, the execution plans for our reports were doing what they are supposed to be doing on an Exadata box – doing full table scans.

However, why were they so slow? Shouldn’t Exadata be doing full table scans really fast?! The answer is obviously yes, but only when the smart scans actually kick in!

As I wanted a high level overview, I added a few more columns to my script (rowsource_events.sql) – namely the IS_PARALLEL, which reports if the active session was a serial session or a PX slave and the wait event (what that session was actually doing).

As the earlier output had shown that close to 50% of DB time was spent on table scans – I drilled down to TABLE ACCESS STORAGE FULL data retrieval operations only, excluding all the other data processing stuff like joins, loads and aggregations:

SQL> @ash/rowsource_events TABLE% STORAGE%FULL

PLAN_LINE                                IS_PARAL SESSION WAIT_CLASS      EVENT                                      COUNT(*)        PCT
---------------------------------------- -------- ------- --------------- ---------------------------------------- ---------- ----------
TABLE ACCESS STORAGE FULL                PARALLEL WAITING User I/O        cell multiblock physical read                139756         47
TABLE ACCESS STORAGE FULL                PARALLEL ON CPU                                                                64899       21.8
TABLE ACCESS STORAGE FULL                SERIAL   WAITING User I/O        cell multiblock physical read                 24133        8.1
TABLE ACCESS STORAGE FULL                PARALLEL WAITING User I/O        cell single block physical read               16430        5.5
TABLE ACCESS STORAGE FULL                PARALLEL WAITING User I/O        read by other session                         12141        4.1
TABLE ACCESS STORAGE FULL                PARALLEL WAITING Cluster         gc buffer busy acquire                        10771        3.6
TABLE ACCESS STORAGE FULL                PARALLEL WAITING User I/O        cell smart table scan                          7573        2.5
TABLE ACCESS STORAGE FULL                SERIAL   WAITING Cluster         gc cr multi block request                      7158        2.4
TABLE ACCESS STORAGE FULL                SERIAL   ON CPU                                                                 6872        2.3
TABLE ACCESS STORAGE FULL                PARALLEL WAITING Cluster         gc cr multi block request                      2610         .9
TABLE ACCESS STORAGE FULL                PARALLEL WAITING User I/O        cell list of blocks physical read              1763         .6
TABLE ACCESS STORAGE FULL                SERIAL   WAITING User I/O        cell single block physical read                1744         .6
TABLE ACCESS STORAGE FULL                SERIAL   WAITING User I/O        cell list of blocks physical read               667         .2
TABLE ACCESS STORAGE FULL                SERIAL   WAITING User I/O        cell smart table scan                           143          0
TABLE ACCESS STORAGE FULL                SERIAL   WAITING Cluster         gc cr disk read                                 122          0
TABLE ACCESS STORAGE FULL                PARALLEL WAITING Cluster         gc current grant busy                            97          0
TABLE ACCESS STORAGE FULL                PARALLEL WAITING Cluster         gc current block 3-way                           85          0
TABLE ACCESS STORAGE FULL                PARALLEL WAITING Cluster         gc cr grant 2-way                                68          0
TABLE ACCESS STORAGE FULL                SERIAL   WAITING User I/O        direct path read                                 66          0
TABLE ACCESS STORAGE FULL                SERIAL   WAITING Cluster         gc current grant 2-way                           52          0

Indeed, over 60% of all my full table scan operations were waiting for buffered read related wait events! (both buffered IO waits and RAC global cache waits which you wouldn’t get with direct path reads, well unless some read consistency cloning & rollback operations are needed). So, Smart Scans were definitely not kicking in for some of those full table scans (despite the support analyst’s outdated claim about PX slaves “never” using buffered IOs)

Also, when you are not offiloading IOs, low level data processing and filtering etc to the storage cells, you will need to use your DB node CPUs for all that work (while the cell CPUs are almost idle, if nothing gets offloaded). So, when enabling the smart scans, both the DB IO wait times – and the DB CPU usage per unit of work done would also go down.

Anyway – after looking deeper into the issue – I noticed that they had parallel_degree_policy set to AUTO, so the in-memory PX also kicked in as a side-effect. Once we set this to LIMITED, the problem went away and instead of lots of buffered IO wait events, we got much less cell table smart scan wait events – and higher CPU utilization (yes, higher) as now we had a big bottleneck removed and got much more work done every second.

This was on Oracle 11.2.0.2, I didn’t have a chance to look any deeper, but I suspected that the problem came from the fact that this database had many different workloads in it, both smart scanning ones and also some regular, index based accesses – so possibly the “in-mem PX automatic memory distributor” or whatever controls what gets cached and what not, wasn’t accounting for tha all non-PX-scan activity. Apparently in Oracle 11.2.0.3 this should be improved – but haven’t tested this myself – I feel safer with manually controlling the CACHE/KEEP parameters.

↧

Hard Drive Predictive Failures on Exadata

November 29, 2013, 4:50 am

≫ Next: cell flash cache read hits vs. cell writes to flash cache statistics on Exadata

≪ Previous: When do Oracle Parallel Execution Slaves issue buffered physical reads – Part 2?

This post also applies to non-Exadata systems as hard drives work the same way in other storage arrays too – just the commands you would use for extracting the disk-level metrics would be different.

I just noticed that one of our Exadatas had a disk put into “predictive failure” mode and thought to show how to measure why the disk is in that mode (as opposed to just replacing it without really understanding the issue ;-)

SQL> @exadata/cellpd
Show Exadata cell versions from V$CELL_CONFIG....

DISKTYPE             CELLNAME             STATUS                 TOTAL_GB     AVG_GB  NUM_DISKS   PREDFAIL   POORPERF WTCACHEPROB   PEERFAIL   CRITICAL
-------------------- -------------------- -------------------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------
FlashDisk            192.168.12.3         normal                      183         23          8
FlashDisk            192.168.12.3         not present                 183         23          8                     3
FlashDisk            192.168.12.4         normal                      366         23         16
FlashDisk            192.168.12.5         normal                      366         23         16
HardDisk             192.168.12.3         normal                    20489       1863         11
HardDisk             192.168.12.3         warning - predictive       1863       1863          1          1
HardDisk             192.168.12.4         normal                    22352       1863         12
HardDisk             192.168.12.5         normal                    22352       1863         12

So, one of the disks in storage cell with IP 192.168.12.3 has been put into predictive failure mode. Let’s find out why!

To find out which exact disk, I ran one of my scripts for displaying Exadata disk topology (partial output below):

SQL> @exadata/exadisktopo2
Showing Exadata disk topology from V$ASM_DISK and V$CELL_CONFIG....

CELLNAME             LUN_DEVICENAME       PHYSDISK                       PHYSDISK_STATUS                                                                  CELLDISK                       CD_DEVICEPART                                                                    GRIDDISK                       ASM_DISK                       ASM_DISKGROUP                  LUNWRITECACHEMODE
-------------------- -------------------- ------------------------------ -------------------------------------------------------------------------------- ------------------------------ -------------------------------------------------------------------------------- ------------------------------ ------------------------------ ------------------------------ ----------------------------------------------------------------------------------------------------
192.168.12.3         /dev/sda             35:0                           normal                                                                           CD_00_enkcel01                 /dev/sda3                                                                        DATA_CD_00_enkcel01            DATA_CD_00_ENKCEL01            DATA                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sda             35:0                           normal                                                                           CD_00_enkcel01                 /dev/sda3                                                                        RECO_CD_00_enkcel01            RECO_CD_00_ENKCEL01            RECO                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdb             35:1                           normal                                                                           CD_01_enkcel01                 /dev/sdb3                                                                        DATA_CD_01_enkcel01            DATA_CD_01_ENKCEL01            DATA                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdb             35:1                           normal                                                                           CD_01_enkcel01                 /dev/sdb3                                                                        RECO_CD_01_enkcel01            RECO_CD_01_ENKCEL01            RECO                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdc             35:2                           normal                                                                           CD_02_enkcel01                 /dev/sdc                                                                         DATA_CD_02_enkcel01            DATA_CD_02_ENKCEL01            DATA                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdc             35:2                           normal                                                                           CD_02_enkcel01                 /dev/sdc                                                                         DBFS_DG_CD_02_enkcel01         DBFS_DG_CD_02_ENKCEL01         DBFS_DG                        "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdc             35:2                           normal                                                                           CD_02_enkcel01                 /dev/sdc                                                                         RECO_CD_02_enkcel01            RECO_CD_02_ENKCEL01            RECO                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdd             35:3                           warning - predictive                                                             CD_03_enkcel01                 /dev/sdd                                                                         DATA_CD_03_enkcel01                                                                          "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdd             35:3                           warning - predictive                                                             CD_03_enkcel01                 /dev/sdd                                                                         DBFS_DG_CD_03_enkcel01                                                                       "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdd             35:3                           warning - predictive                                                             CD_03_enkcel01                 /dev/sdd                                                                         RECO_CD_03_enkcel01                                                                          "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"

Ok, looks like /dev/sdd (with address 35:3) is the “failed” one.

When listing the alerts from the storage cell, indeed we see that a failure has been predicted, warning raised and even handled – XDMG process gets notified and the ASM disks get dropped from the failed grid disks (as you see from the exadisktopo output above if you scroll right).

CellCLI> LIST ALERTHISTORY WHERE alertSequenceID = 456 DETAIL;
	 name:              	 456_1
	 alertDescription:  	 "Data hard disk entered predictive failure status"
	 alertMessage:      	 "Data hard disk entered predictive failure status.  Status        : WARNING - PREDICTIVE FAILURE  Manufacturer  : HITACHI  Model Number  : H7220AA30SUN2.0T  Size          : 2.0TB  Serial Number : 1016M7JX2Z  Firmware      : JKAOA28A  Slot Number   : 3  Cell Disk     : CD_03_enkcel01  Grid Disk     : DBFS_DG_CD_03_enkcel01, DATA_CD_03_enkcel01, RECO_CD_03_enkcel01"
	 alertSequenceID:   	 456
	 alertShortName:    	 Hardware
	 alertType:         	 Stateful
	 beginTime:         	 2013-11-27T07:48:03-06:00
	 endTime:           	 2013-11-27T07:55:52-06:00
	 examinedBy:        	 
	 metricObjectName:  	 35:3
	 notificationState: 	 1
	 sequenceBeginTime: 	 2013-11-27T07:48:03-06:00
	 severity:          	 critical
	 alertAction:       	 "The data hard disk has entered predictive failure status. A white cell locator LED has been turned on to help locate the affected cell, and an amber service action LED has been lit on the drive to help locate the affected drive. The data from the disk will be automatically rebalanced by Oracle ASM to other disks. Another alert will be sent and a blue OK-to-Remove LED will be lit on the drive when rebalance completes. Please wait until rebalance has completed before replacing the disk. Detailed information on this problem can be found at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id=1112995.1  "

	 name:              	 456_2
	 alertDescription:  	 "Hard disk can be replaced now"
	 alertMessage:      	 "Hard disk can be replaced now.  Status        : WARNING - PREDICTIVE FAILURE  Manufacturer  : HITACHI  Model Number  : H7220AA30SUN2.0T  Size          : 2.0TB  Serial Number : 1016M7JX2Z  Firmware      : JKAOA28A  Slot Number   : 3  Cell Disk     : CD_03_enkcel01  Grid Disk     : DBFS_DG_CD_03_enkcel01, DATA_CD_03_enkcel01, RECO_CD_03_enkcel01 "
	 alertSequenceID:   	 456
	 alertShortName:    	 Hardware
	 alertType:         	 Stateful
	 beginTime:         	 2013-11-27T07:55:52-06:00
	 examinedBy:        	 
	 metricObjectName:  	 35:3
	 notificationState: 	 1
	 sequenceBeginTime: 	 2013-11-27T07:48:03-06:00
	 severity:          	 critical
	 alertAction:       	 "The data on this disk has been successfully rebalanced by Oracle ASM to other disks. A blue OK-to-Remove LED has been lit on the drive. Please replace the drive."

The two alerts show that we first detected a (soon) failing disk (event 456_1) and then ASM kicked in and dropped the ASM disks from the failing disk and rebalanced the data elsewhere (event 456_2).

But we still do not know why we are expecting the disk to fail! And the alert info and CELLCLI command output do not have this detail. This is where the S.M.A.R.T monitoring comes in. Major hard drive manufacturers support the SMART standard for both reactive and predictive monitoring of the hard disk internal workings. And there are commands for querying these metrics.

Let’s find the failed disk info at the cell level with CELLCLI:

CellCLI> LIST PHYSICALDISK;
	 35:0     	 JK11D1YAJTXVMZ	 normal
	 35:1     	 JK11D1YAJB4V0Z	 normal
	 35:2     	 JK11D1YAJAZMMZ	 normal
	 35:3     	 JK11D1YAJ7JX2Z	 warning - predictive failure
	 35:4     	 JK11D1YAJB3J1Z	 normal
	 35:5     	 JK11D1YAJB4J8Z	 normal
	 35:6     	 JK11D1YAJ7JXGZ	 normal
	 35:7     	 JK11D1YAJB4E5Z	 normal
	 35:8     	 JK11D1YAJ8TY3Z	 normal
	 35:9     	 JK11D1YAJ8TXKZ	 normal
	 35:10    	 JK11D1YAJM5X9Z	 normal
	 35:11    	 JK11D1YAJAZNKZ	 normal
	 FLASH_1_0	 1014M02JC3    	 not present
	 FLASH_1_1	 1014M02JYG    	 not present
	 FLASH_1_2	 1014M02JV9    	 not present
	 FLASH_1_3	 1014M02J93    	 not present
	 FLASH_2_0	 1014M02JFK    	 not present
	 FLASH_2_1	 1014M02JFL    	 not present
	 FLASH_2_2	 1014M02JF7    	 not present
	 FLASH_2_3	 1014M02JF8    	 not present
	 FLASH_4_0	 1014M02HP5    	 normal
	 FLASH_4_1	 1014M02HNN    	 normal
	 FLASH_4_2	 1014M02HP2    	 normal
	 FLASH_4_3	 1014M02HP4    	 normal
	 FLASH_5_0	 1014M02JUD    	 normal
	 FLASH_5_1	 1014M02JVF    	 normal
	 FLASH_5_2	 1014M02JAP    	 normal
	 FLASH_5_3	 1014M02JVH    	 normal

Ok, let’s look into the details, as we also need the deviceId for querying the SMART info:

CellCLI> LIST PHYSICALDISK 35:3 DETAIL;
	 name:              	 35:3
	 deviceId:          	 26
	 diskType:          	 HardDisk
	 enclosureDeviceId: 	 35
	 errMediaCount:     	 0
	 errOtherCount:     	 0
	 foreignState:      	 false
	 luns:              	 0_3
	 makeModel:         	 "HITACHI H7220AA30SUN2.0T"
	 physicalFirmware:  	 JKAOA28A
	 physicalInsertTime:	 2010-05-15T21:10:49-05:00
	 physicalInterface: 	 sata
	 physicalSerial:    	 JK11D1YAJ7JX2Z
	 physicalSize:      	 1862.6559999994934G
	 slotNumber:        	 3
	 status:            	 warning - predictive failure

Ok, the disk device was /dev/sdd, the disk name is 35:3 and the device ID is 26. And it’s a SATA disk. So I will run smartctl with the sat+megaraid device type option to query the disk SMART metrics – via the SCSI controller where the disks are attached to. Note that the ,26 in the end is the deviceId reported by the LIST PHYSICALDISK command. There’s quite a lot of output, I have highlighted the important part in red:

> smartctl  -a  /dev/sdd -d sat+megaraid,26
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.32-400.11.1.el5uek] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     HITACHI H7220AA30SUN2.0T 1016M7JX2Z
Serial Number:    JK11D1YAJ7JX2Z
LU WWN Device Id: 5 000cca 221df9d11
Firmware Version: JKAOA28A
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Thu Nov 28 06:28:13 2013 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(22330) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   028   028   016    Pre-fail  Always       -       430833663
  2 Throughput_Performance  0x0005   132   132   054    Pre-fail  Offline      -       103
  3 Spin_Up_Time            0x0007   117   117   024    Pre-fail  Always       -       614 (Average 624)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       69
  5 Reallocated_Sector_Ct   0x0033   058   058   005    Pre-fail  Always       -       743
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   112   112   020    Pre-fail  Offline      -       39
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       30754
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       69
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       80
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       80
194 Temperature_Celsius     0x0002   253   253   000    Old_age   Always       -       23 (Min/Max 17/48)
196 Reallocated_Event_Count 0x0032   064   064   000    Old_age   Always       -       827
197 Current_Pending_Sector  0x0022   089   089   000    Old_age   Always       -       364
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     30754         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

So, from the highlighted output above we see that the Raw_Read_Error_Rate indicator for this hard drive is pretty close to the threshold of 16. The SMART metrics really are just “health indicators” following the defined standards (read the wiki article). The health indicator value can range from 0 to 253. For the Raw_read_Error_Rate metric (which measures the physical read success from the disk surface), bigger value is better (apparently 100 is the max with the Hitachi disks at least, most of the other disks were showing around 90-100). So whenever there are media read errors, the metric will drop, the more errors, the more it will drop.

Apparently some of the read errors are inevitable (and detected by various checks like ECC), especially in the high-density disks. The errors will be corrected/worked around, sometimes via ECC, sometimes by a re-read. So, yes, your hard drive performance can get worse as disks age or are about to fail. If the metric ever goes below the defined threshold of 16, the disk apparently (and consistently over some period of time) isn’t working so great, so it should better be replaced.

Note that the RAW_VALUE column does not neccesarily show the number of failed reads from the disk platter. It may represent the number of sectors that failed to be read or it may be just a bitmap – or both combined into the low- and high-order bytes of this value. For example, when converting the raw value of 430833663 to hex, we get 0x19ADFFFF. Perhaps the low-order FFFF is some sort of a bitmap and the high 0rder 0x19AD is the number of failed sectors or reads. There’s some more info available about Seagate disks, but in our V2 we have Hitachi ones and I couldn’t find anything about how to decode the RAW_VALUE for their disks. So, we just need to trust that the “normalized” SMART health indicators for the different metrics tell us when there’s a problem.

Even though when I ran the smartctl command, I did not see the actual value (nor the worst value) crossing the threshold, 28 is still pretty close to the threshold 16, considering that normally the indicator should be close to 100. So my guess here is that the indicator actually did cross the threshold, this is when the alert got raised. It’s just that by the time I logged in and ran my diagnostics commands, the disk worked better again. It looks like the “worst” values are not remembered properly by the disks (or it could be that some SMART tool resets these every now and then). Note that we would see SMART alerts with the actual problem metric values in the Linux /var/log/messages file if the smartd service were enabled in the Storage Cell Linux OS – but apparently it’s disabled and probably some Oracle’s own daemon in the cell is monitoring that.

So what does this info tell us – a low “health indicator” for the Raw_Read_Error_Rate means that there are problems with physically reading the sequences of bits from the disk platter. This means bad sectors or weak sectors (that are soon about to become bad sectors probably). So, had we seen a bad health state for UDMA_CRC_Error_Count for example, it would have indicated a data transfer issue over the SATA cable. So, it looks like the reason for the disk being in the predictive failure state is about it having just too many read errors from the physical disk platter.

If you look into the other highlighted metrics above – the Reallocated_Sector_Ct and Current_Pending_Sector, you see there are hundreds of disk sectors (743 and 364) that have had IO issues, but eventually the reads finished ok and the sectors were migrated (remapped) to a spare disk area. As these disks have 512B sector size, this means that some Oracle block-size IOs from a single logical sector range may actually have to read part of the data from the original location and seek to some other location on the disk for reading the rest (from the remapped sector). So, again, your disk performance may get worse when your disk is about to fail or is just having quality issues.

For reference, here’s an example from another, healthier disk in this Exadata storage cell:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   086   086   016    Pre-fail  Always       -       2687039
  2 Throughput_Performance  0x0005   133   133   054    Pre-fail  Offline      -       99
  3 Spin_Up_Time            0x0007   119   119   024    Pre-fail  Always       -       601 (Average 610)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       68
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   114   114   020    Pre-fail  Offline      -       38
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       30778
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       68
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       81
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       81
194 Temperature_Celsius     0x0002   253   253   000    Old_age   Always       -       21 (Min/Max 16/46)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

The Raw_Read_Error_Rate indicator still shows 86 (% from ideal?), but it’s much farther away from the threshold 16. Many of the other disks showed even 99 or 100 and apparently this metric value changed as the disk behavior changed. Some disks with value of 88 jumped to 100 and an hour later they were at 95 and so on. So the VALUE column allows real time monitoring of these internal disk metrics, the only thing that doesn’t make sense right now is that why does teh WORST column get reset over time.

For the better behaving disk, the Reallocated_Sector_Ct and Current_Pending_Sector metrics show zero, so this disk doesn’t seem to have bad or weak sectors (yet).

I hope that this post is another example that it is possible to dig deeper, but only when the piece of software or hardware is properly instrumented, of course. If it didn’t have such instrumentation, it would be way harder (you would have to take a stethoscope and record the noise of the hard drive for analysis or open the drive in dustless vacuum and see what it’s doing yourself ;-)

Note that I will be talking about systematic Exadata Performance troubleshooting (and optimization) my Advanced Exadata Performance online seminar on 16-20. December.

↧

cell flash cache read hits vs. cell writes to flash cache statistics on Exadata

December 4, 2013, 3:57 am

≫ Next: Oracle X$ tables – Part 1 – Where do they get their data from?

≪ Previous: Hard Drive Predictive Failures on Exadata

When the Smart Flash Cache was introduced in Exadata, it was caching reads only. So there were only read “optimization” statistics like cell flash cache read hits and physical read requests/bytes optimized in V$SESSTAT and V$SYSSTAT (the former accounted for the read IO requests that got its data from the flash cache and the latter ones accounted the disk IOs avoided both thanks to the flash cache and storage indexes). So if you wanted to measure the benefit of flash cache only, you’d have to use the cell flash cache read hits metric.

This all was fine until you enabled the Write-Back flash cache in a newer version of cellsrv. We still had only the “read hits” statistic in the V$ views! And when investigating it closer, both the read hits and write hits were accumulated in the same read hits statistic! (I can’t reproduce this on our patched 11.2.0.3 with latest cellsrv anymore, but it was definitely the behavior earlier, as I demoed it in various places).

Side-note: This is likely because it’s not so easy to just add more statistics to Oracle code within a single small patch. The statistic counters are referenced by other modules using macros with their direct numeric IDs (and memory offsets to v$sesstat array) and the IDs & addresses would change when more statistics get added. So, you can pretty much add new statistic counters only with new full patchsets, like 11.2.0.4. It’s the same with instance parameters by the way, that’s why the “spare” statistics and spare parameters exist, they’re placeholders for temporary use, until the new parameter or statistic gets added permanently with a full patchset update.

So, this is probably the reason why both the flash cache read and write hits got initially accumulated under the cell flash cache read hits statistic, but later on this seemed to get “fixed”, so that the read hits only showed read hits and the flash write hits were not accounted anywhere. You can test this easily by measuring your DBWR’s v$sesstat metrics with snapper for example, if you get way more cell flash cache read hits than physical read total IO requests, then you’re probably accumulating both read and write hits in the same metric.

Let’s look into a few different database versions:

SQL> @i

USERNAME             INST_NAME    HOST_NAME                 SID   SERIAL#  VERSION    STARTED 
-------------------- ------------ ------------------------- ----- -------- ---------- --------
SYS                  db12c1       enkdb03.enkitec.com       1497  20671    12.1.0.1.0 20131127

SQL> @sys cell%flash

NAME                                                                                  VALUE
---------------------------------------------------------------- --------------------------
cell flash cache read hits                                                          1874361

In the 12.1.0.1 database above, we still have only the read hits metric. But in the Oracle 11.2.0.4 output below, we finally have the flash cache IOs broken down by reads and writes, plus a few special metrics indicating if the block written to already existed in the flash cache (cell overwrites in flash cache) and when the block range written to flash was only partially cached in flash already when the DB issued the write (cell partial writes in flash cache):

SQL> @i

USERNAME             INST_NAME    HOST_NAME                 SID   SERIAL#  VERSION    STARTED 
-------------------- ------------ ------------------------- ----- -------- ---------- --------
SYS                  dbm012       enkdb02.enkitec.com       199   607      11.2.0.4.0 20131201

SQL> @sys cell%flash

NAME                                                                                  VALUE
---------------------------------------------------------------- --------------------------
cell writes to flash cache                                                           711439
cell overwrites in flash cache                                                       696661
cell partial writes in flash cache                                                        9
cell flash cache read hits                                                           699240

So, this probably means that the upcoming Oracle 12.1.0.2 will have the flash cache write hit metrics in it too. So in the newer versions there’s no need to get creative when estimating the write-back flash cache hits in our performance scripts (the Exadata Snapper currently tries to derive this value from other metrics, relying on the bug where both read and write hits accumulated under the same metric, so I will need to update it based on the DB version we are running on).

So, when I look into one of the DBWR processes in a 11.2.0.4 DB on Exadata, I see the breakdown of flash read vs write hits:

SQL> @i

USERNAME             INST_NAME    HOST_NAME                 SID   SERIAL#  VERSION    STARTED 
-------------------- ------------ ------------------------- ----- -------- ---------- --------
SYS                  dbm012       enkdb02.enkitec.com       199   607      11.2.0.4.0 20131201

SQL> @exadata/cellver
Show Exadata cell versions from V$CELL_CONFIG....

CELL_PATH            CELL_NAME            CELLSRV_VERSION      FLASH_CACHE_MODE     CPU_COUNT 
-------------------- -------------------- -------------------- -------------------- ----------
192.168.12.3         enkcel01             11.2.3.2.1           WriteBack            16        
192.168.12.4         enkcel02             11.2.3.2.1           WriteBack            16        
192.168.12.5         enkcel03             11.2.3.2.1           WriteBack            16        

SQL> @ses2 "select sid from v$session where program like '%DBW0%'" flash

       SID NAME                                                                  VALUE
---------- ---------------------------------------------------------------- ----------
       296 cell writes to flash cache                                            50522
       296 cell overwrites in flash cache                                        43998
       296 cell flash cache read hits                                               36

SQL> @ses2 "select sid from v$session where program like '%DBW0%'" optimized

       SID NAME                                                                  VALUE
---------- ---------------------------------------------------------------- ----------
       296 physical read requests optimized                                         36
       296 physical read total bytes optimized                                  491520
       296 physical write requests optimized                                     25565
       296 physical write total bytes optimized                              279920640

If you are wondering that why is the cell writes to flash cache metric roughly 2x bigger than the physical write requests optimized, it’s because of the ASM double mirroring we use. The physical writes metrics are counted at the database-scope IO layer (KSFD), but the ASM mirroring is done at a lower layer in the Oracle process codepath (KFIO). So when the DBWR issues a 1 MB write, v$sesstat metrics would record a 1 MB IO for it, but the ASM layer at the lower level would actually do 2 or 3x more IO due to double- or triple-mirroring. As the cell writes to flash cache metric is actually sent back from all storage cells involved in the actual (ASM-mirrored) write IOs, then we will see more around 2-3x storage flash write hits, than physical writes issued at the database level (depending on which mirroring level you use). Another way of saying this would be that the “physical writes” metrics are measured at higher level, “above” the ASM mirroring and the “flash hits” metrics are measured at a lower level, “below” the ASM mirroring in the IO stack.

↧

Oracle X$ tables – Part 1 – Where do they get their data from?

January 10, 2014, 11:38 am

≫ Next: Hotsos Symposium 2014

≪ Previous: cell flash cache read hits vs. cell writes to flash cache statistics on Exadata

It’s long-time public knowledge that X$ fixed tables in Oracle are just “windows” into Oracle’s memory. So whenever you query an X$ table, the FIXED TABLE rowsource function in your SQL execution plan will just read some memory structure, parse its output and show you the results in tabular form. This is correct, but not the whole truth.

Check this example. Let’s query the X$KSUSE table, which is used by V$SESSION:

SQL> SELECT addr, indx, ksuudnam FROM x$ksuse WHERE rownum <= 5;

ADDR           INDX KSUUDNAM
-------- ---------- ------------------------------
391513C4          1 SYS
3914E710          2 SYS
3914BA5C          3 SYS
39148DA8          4 SYS
391460F4          5 SYS

Now let’s check in which Oracle memory region this memory address resides (SGA, PGA, UGA etc). I’m using my script fcha for this (Find CHunk Address). You should probably not run this script in busy production systems as it uses the potentially dangerous X$KSMSP fixed table:

SQL> @fcha 391513C4
Find in which heap (UGA, PGA or Shared Pool) the memory address 391513C4 resides...

WARNING!!! This script will query X$KSMSP, which will cause heavy shared pool latch contention
in systems under load and with large shared pool. This may even completely hang
your instance until the query has finished! You probably do not want to run this in production!

Press ENTER to continue, CTRL+C to cancel...

LOC KSMCHPTR   KSMCHIDX   KSMCHDUR KSMCHCOM           KSMCHSIZ KSMCHCLS   KSMCHTYP KSMCHPAR
--- -------- ---------- ---------- ---------------- ---------- -------- ---------- --------
SGA 39034000          1          1 permanent memor     3977316 perm              0 00

SQL>

Ok, these X$KSUSE (V$SESSION) records reside in a permanent allocation in SGA and my X$ query apparently just parsed & presented the information from there.

Now, let’s query something else, for example the “Soviet Union” view X$KCCCP:

SQL> SELECT addr, indx, inst_id, cptno FROM x$kcccp WHERE rownum <= 5;

ADDR           INDX    INST_ID      CPTNO
-------- ---------- ---------- ----------
F692347C          0          1          1
F692347C          1          1          2
F692347C          2          1          3
F692347C          3          1          4
F692347C          4          1          5

Ok, let’s see where do these records reside:

SQL> @fcha F692347C
Find in which heap (UGA, PGA or Shared Pool) the memory address F692347C resides...

WARNING!!! This script will query X$KSMSP, which will cause heavy shared pool latch contention
in systems under load and with large shared pool. This may even completely hang
your instance until the query has finished! You probably do not want to run this in production!

Press ENTER to continue, CTRL+C to cancel...

LOC KSMCHPTR   KSMCHIDX   KSMCHDUR KSMCHCOM           KSMCHSIZ KSMCHCLS   KSMCHTYP KSMCHPAR
--- -------- ---------- ---------- ---------------- ---------- -------- ---------- --------
UGA F6922EE8                       kxsFrame4kPage         4124 freeabl           0 00

SQL>

Wow, why does the X$KCCCP data reside in my session’s UGA? This is where the extra complication (and sophistication) of X$ fixed tables comes into play!

Some X$ tables do not simply read whatever is in some memory location, but they have helper functions associated with them (something like fixed packages that the ASM instance uses internally). So, whenever you query this X$, then first a helper function is called, which will retrieve the source data from whereever it needs to, then copies it to your UGA in the format corresponding to this X$ and then the normal X$ memory location parsing & presentation code kicks in.

If you trace what the X$KCCCP access does – you’d see a bunch of control file parallel read wait events every time you query the X$ table (to retrieve the checkpoint progress records). So this X$ is not doing just a passive read only presentation of some memory structure (array). The helper function will first do some real work, allocates some runtime memory for the session (the kxsFrame4kPage chunk in UGA) and copies the results of its work to this UGA area – so that the X$ array & offset parsing code can read and present it back to the query engine.

In other words, the ADDR column in X$ tables does not necessarily show where the source data it shows ultimately lives, but just where the final array that got parsed for presentation happened to be. Sometimes the parsed data structure is the ultimate source where it comes from, sometimes a helper function needs to do a bunch of work (like taking latches and walking linked lists for X$KSMSP or even doing physical disk reads from controlfiles for X$KCCCP access).

And more, let’s run the same query against X$KCCCP twice:

SQL> SELECT addr, indx, inst_id, cptno FROM x$kcccp WHERE rownum <= 5;

ADDR           INDX    INST_ID      CPTNO
-------- ---------- ---------- ----------
F69254B4          0          1          1
F69254B4          1          1          2
F69254B4          2          1          3
F69254B4          3          1          4
F69254B4          4          1          5

And once more:

SQL> SELECT addr, indx, inst_id, cptno FROM x$kcccp WHERE rownum <= 5;

ADDR           INDX    INST_ID      CPTNO
-------- ---------- ---------- ----------
F692B508          0          1          1
F692B508          1          1          2
F692B508          2          1          3
F692B508          3          1          4
F692B508          4          1          5

See how the ADDR column has changed between executions even though we are querying the same data! This is not because the controlfiles or the source data have somehow relocated. It’s just that the temporary cursor execution scratch area, where the final data structure was put for presentation (kxsFrame4kPage chunk in UGA), just happened to be allocated from different locations for the two different executions.

There may be exceptions, but as long as the ADDR resides in SGA, I’d say it’s the actual location of where the data lives – but when it’s in UGA/PGA, it may be just the temporary cursor scratch area and the source data was taken from somewhere else (especially when the ADDR constantly changes or alternates between 2-3 different variants when repeatedly running your X$ query). Note that there are X$ tables which intentionally read data from arrays in your UGA (the actual source data lives in the UGA or PGA itself), but more about that in the future.

↧

Related Posts

Related Posts

Related Posts

Related Posts

Install the fix as a bookmarklet

Related Posts

Related Posts

Runtime Difference of Scalar Subquery Filtering with Exadata Smart Scans

Related Posts

Related Posts

Oracle OpenWorld, 22-26 September 2013, San Francisco

OakTableWorld, 23-24 September 2013, San Francisco

Strata Conference + HadoopWorld NY, 28-30 October 2013, New York City, NY

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts