Quantcast
Channel: Tanel Poder – Tanel Poder's Performance & Troubleshooting blog
Viewing all 87 articles
Browse latest View live

Hotsos Symposium 2014

$
0
0

After missing last year’s Hotsos Symposium (trying to cut my travel as you know :), I will present at and deliver the full-day Training Day at this year’s Hotsos Symposium! It will be my 10th time to attend (and speak at) this awesome conference. So I guess this means more beer than usual. Or maybe less, as I’m getting old. Let’s make it as usual, then :0)

I have (finally) sent the abstract and the TOC of the Training Day to Hotsos folks and they’ve been uploaded. So, check out the conference sessions and the training day contents here. I aim to keep my training day very practical – I’ll be just showing how I troubleshoot most issues that I hit, with plenty of examples. It will be suitable both for developers and DBAs. In the last part of the training day I will talk about some Oracle 12c internals and will dive a bit deeper to the lower levels of troubleshooting so we can have some fun too.

Looks like we’ll be having some good time!


Where does the Exadata storage() predicate come from?

$
0
0

On Exadata (or when setting cell_offload_plan_display = always on non-Exadata) you may see the storage() predicate in addition to the usual access() and filter() predicates in an execution plan:

SQL> SELECT * FROM dual WHERE dummy = 'X';

D
-
X

Check the plan:

SQL> @x
Display execution plan for last statement for this session from library cache...

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
SQL_ID  dtjs9v7q7zj1g, child number 0
-------------------------------------
SELECT * FROM dual WHERE dummy = 'X'

Plan hash value: 272002086

------------------------------------------------------------------------
| Id  | Operation                 | Name | E-Rows |E-Bytes| Cost (%CPU)|
------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |      |        |       |     2 (100)|
|*  1 |  TABLE ACCESS STORAGE FULL| DUAL |      1 |     2 |     2   (0)|
------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage("DUMMY"='X')
       filter("DUMMY"='X')

The access() and filter() predicates come from the corresponding ACCESS_PREDICATES and FILTER_PREDICATES columns in V$SQL_PLAN. But there’s no STORAGE_PREDICATES column there!

SQL> @desc v$sql_plan
           Name                            Null?    Type
           ------------------------------- -------- ----------------------------
    1      ADDRESS                                  RAW(4)
    2      HASH_VALUE                               NUMBER
    3      SQL_ID                                   VARCHAR2(13)
  ...
   33      TEMP_SPACE                               NUMBER
   34      ACCESS_PREDICATES                        VARCHAR2(4000)
   35      FILTER_PREDICATES                        VARCHAR2(4000)
   36      PROJECTION                               VARCHAR2(4000)
  ...
   40      OTHER_XML                                CLOB

So where does the storage predicate come from then?

The answer is that there is no storage() predicate column in any V$ views. The storage() predicate actually comes from the ACCESS_PREDICATE column, but the DBMS_XPLAN.DISPLAY functions just have extra logic in them that if the execution plan line (OPTION column in V$SQL_PLAN) contains STORAGE string, then any access() predicates for that line must be storage() predicates instead!

SQL> SELECT id, access_predicates,filter_predicates FROM v$sql_plan WHERE sql_id = 'dtjs9v7q7zj1g' AND child_number = 0;

        ID ACCESS_PREDICATES    FILTER_PREDICATES
---------- -------------------- --------------------
         0
         1 "DUMMY"='X'          "DUMMY"='X'

This actually makes sense, as the filter() predicates are the “dumb brute-force” predicates that are not able to pass any information (about what values are they looking for) inside the access path row source they are filtering. In other words, a filter() function fetches all the rows from its rowsource and it throws away everything that doesn’t match the filter condition.

The access() predicate, on the other hand, is able to pass in the value (or range) it’s looking for inside its row source. For example, when doing an index unique lookup, the access() predicate can send the value your query is looking for right into the index traversing code, so you only retrieve the rows you want as opposed to retrieving everything and throwing the non-wanted rows away.

So the access() predicate traditionally showed up for index access paths and also hash join row sources, but never for full table scans. Now, with Exadata, even full table scans can work in a smart way (allowing you pass in the values you’re looking for into the storage layer), so some of the full scanning row sources support the access() predicate now too – with the catch that if the OPTION column in V$SQL_PLAN contains “STORAGE”, the access() predicates are shown as storage().

Note that the SQL Monitor reports (to my knowledge) still don’t support this display logic, so you would see row sources like TABLE ACCESS STORAGE FULL with filter() and access() predicates on them – the access() on these STORAGE row sources really means storage()

Slides of my previous presentations

$
0
0

Here are the slides of some of my previous presentations (that I haven’t made public yet, other than delivering these at conferences and training sessions):

Scripts and Tools That Make Your Life Easier and Help to Troubleshoot Better:

  • I delivered this presentation at the Hotsos Symposium Training Day in year 2010:

 

Troubleshooting Complex Performance Issues – Part1:

 

Troubleshooting Complex Performance Issues – Part2

 

Oracle Memory Troubleshooting, Part 4: Drilling down into PGA memory usage with V$PROCESS_MEMORY_DETAIL

$
0
0

If you haven’t read them – here are the previous articles in Oracle memory troubleshooting series: Part 1Part 2, Part 3.

Let’s say you have noticed that one of your Oracle processes is consuming a lot of private memory. The V$PROCESS has PGA_USED_MEM / PGA_ALLOC_MEM columns for this. Note that this view will tell you what Oracle thinks it’s using – how much of allocated/freed bytes it has kept track of. While this doesn’t usually tell you the true memory usage of a process, as other non-Oracle-heap allocation routines and the OS libraries may allocate (and leak) memory of their own, it’s a good starting point and usually enough.

Then, the V$PROCESS_MEMORY view would allow you to see a basic breakdown of that process’es memory usage – is it for SQL, PL/SQL, Java, unused (Freeable) or for “Other” reasons. You can use either the smem.sql or pmem.sql scripts for this (report v$process_memory for a SID or OS PID):

SQL> @smem 198
Display session 198 memory usage from v$process_memory....

       SID        PID    SERIAL# CATEGORY         ALLOCATED       USED MAX_ALLOCATED
---------- ---------- ---------- --------------- ---------- ---------- -------------
       198         43         17 Freeable           1572864          0
       198         43         17 Other              5481102                  5481102
       198         43         17 PL/SQL                2024        136          2024
       198         43         17 SQL              117805736  117717824     118834536

From the above output we see that this session has allocated over 100MB of private memory for “SQL” reasons. This normally means SQL workareas, so we can break this down further by querying V$SQL_WORKAREA_ACTIVE that shows us all currently in-use cursor workareas in the instance. I’m using a script wrka.sql for convenience – and listing only my SID-s workareas:

SQL> @wrka sid=198
Show Active workarea memory usage for where sid=198...

   INST_ID        SID  QCINST_ID      QCSID SQL_ID        OPERATION_TYPE                  PLAN_LINE POLICY                   ACTIVE_SEC ACTUAL_MEM_USED MAX_MEM_USED WORK_AREA_SIZE NUMBER_PASSES TEMPSEG_SIZE TABLESPACE
---------- ---------- ---------- ---------- ------------- ------------------------------ ---------- ------------------------ ---------- --------------- ------------ -------------- ------------- ------------ ------------------------------
         1        198                       ff8v9qhv21pm5 SORT (v2)                               1 AUTO                           14.6        64741376    104879104       97623040             0   2253389824 TEMP
         1        198                       ff8v9qhv21pm5 HASH-JOIN                               6 AUTO                           14.8         1370112      1370112        2387968             0
         1        198                       ff8v9qhv21pm5 BUFFER                                 25 AUTO                           14.8        11272192     11272192       11272192             0

The ACTUAL_MEM_USED column above shows the currently used memory by this workarea (that happens to be a SORT (v2) operation in that cursor’s execution plan line #1). It was only about 64MB at the time I got to query this view, but the MAX_MEM_USED shows it was about 100MB at its peak. This can happen due to multipass operations where the merge phase may use less memory than the sort phase or once the sorting completed and the rowsource was ready to start sending sorted rows back, not that much memory would have been needed anymore for just buffering the blocks read from TEMP (the sort_area_size vs sort_area_retained_size thing from past).

For completeness, I also have a script called wrkasum.sql that summarizes the workarea memory usage of all sessions in an instance (so if you’re not interested in a single session, but rather a summary of which operation types tend to consume most memory etc) you can use that:

SQL> @wrkasum
Top allocation reason by PGA memory usage

OPERATION_TYPE      POLICY      ACTUAL_PGA_MB ALLOWED_PGA_MB    TEMP_MB NUM_PASSES     NUM_QC NUM_SESSIONS 
------------------- ----------- ------------- -------------- ---------- ---------- ---------- ------------ 
SORT (v2)           AUTO                   58            100       1525          0          1            1            
BUFFER              AUTO                   11             11                     0          1            1            
HASH-JOIN           AUTO                    1              2                     0          1            1

You may want to modify the script to change the GROUP BY to SQL_ID you want to list the top workarea-memory consuming SQL statement across the whole instance (or any other column of interest – like QC_INST_ID/QCSID).

But what about the following example:

SQL> @pmem 27199
Display process memory usage for SPID 27199...

       SID SPID                            PID    SERIAL# CATEGORY         ALLOCATED       USED MAX_ALLOCATED     CON_ID
---------- ------------------------ ---------- ---------- --------------- ---------- ---------- ------------- ----------
      1516 27199                           120        198 Freeable            786432          0                        0
      1516 27199                           120        198 Other            842807461                842807461          0
      1516 27199                           120        198 PL/SQL              421064      77296        572344          0
      1516 27199                           120        198 SQL                2203848      50168       2348040          0

Most of the memory (over 800MB) is consumed by category “Other”?! Not that helpful, huh? V$SQL_WORKAREA_ACTIVE didn’t show anything either as it deals only with SQL workareas and not all the other possible reasons why an Oracle process might allocate memory.

So we need a way to drill down into the Other category and see which allocation reasons have taken the most of this memory. Historically this was only doable with a PGA/UGA memory heapdump and by aggregating the resulting dumpfile. You have to use oradebug to get the target process to dump its own private memory breakdown as it’s private memory and other processes can not just read it directly. I have written about it in Part 1 of the Oracle memory troubleshooting series.

Update: an alternative to ORADEBUG is to use ALTER SESSION SET EVENTS ‘immediate trace name pga_detail_get level N’ where N is the Oracle PID of the process. 

However starting from Oracle 10.2 you can get similar detailed breakdown info by querying V$PROCESS_MEMORY_DETAIL, no need for post-processing tracefiles! However when you just query it, the view does not return any rows:

SQL> SELECT * FROM v$process_memory_detail;

no rows selected

Again this is for the abovementioned reasons – your current process can not just read the contents of some other process’es private memory – the OS ensures that. You will have to ask that target process to populate the V$PROCESS_MEMORY_DETAIL with its memory allocation breakdown. You can do this by using the ORADEBUG DUMP PGA_DETAIL_GET command:

SQL> ORADEBUG SETMYPID
Statement processed.
SQL> ORADEBUG DUMP PGA_DETAIL_GET 49
Statement processed.

The number 49 above is the Oracle PID (v$process.pid) of the target process I want to examine. The oradebug PGA_DETAIL_GET command will not immediately make the target process to report its usage – it will merely set a flag somewhere and the target process itself checks it when it is active. In other words, if the target process is idle or sleeping for a long time (due to some lock for example), then it won’t populate the V$ view with required data. In my test environment, the V$PROCESS_MEMORY_DETAIL got populated only after I ran another dummy command in the target session. This shouldn’t be an issue if you are examining a process that’s actively doing something (and not idle/sleeping for a long time).

The output below is from another dummy demo session that wasn’t using much of memory:

SQL> SELECT * FROM v$process_memory_detail ORDER BY pid, bytes DESC;

       PID    SERIAL# CATEGORY        NAME                       HEAP_NAME            BYTES ALLOCATION_COUNT HEAP_DES PARENT_H
---------- ---------- --------------- -------------------------- --------------- ---------- ---------------- -------- --------
        49          5 Other           permanent memory           pga heap            162004               19 11B602C0 00
        49          5 SQL             QERHJ Bit vector           QERHJ hash-joi      131168                8 F691EF4C F68F6F7C
        49          5 Other           kxsFrame4kPage             session heap         57736               14 F68E7134 11B64780
        49          5 SQL             free memory                QERHJ hash-joi       54272                5 F691EF4C F68F6F7C
        49          5 Other           free memory                pga heap             41924                8 11B602C0 00
        49          5 Other           miscellaneous                                   39980              123 00       00
        49          5 Other           Fixed Uga                  Fixed UGA heap       36584                1 F6AA44B0 11B602C0
        49          5 Other           permanent memory           top call heap        32804                2 11B64660 00
        49          5 Other           permanent memory           session heap         32224                2 F68E7134 11B64780
        49          5 Other           free memory                top call heap        31692                1 11B64660 00
        49          5 Other           kgh stack                  pga heap             17012                1 11B602C0 00
        49          5 Other           kxsFrame16kPage            session heap         16412                1 F68E7134 11B64780
        49          5 Other           dbgeInitProcessCtx:InvCtx  diag pga             15096                2 F75A8630 11B602C0
...

The BYTES column shows the sum of memory allocated from private memory heap HEAP_NAME for the reason shown in NAME column. If you want to know the average allocation (chunk) size in the heap, divide BYTES by ALLOCATION_COUNT.
For example, the top PGA memory user in that process is an allocation called “permanent memory”, 162004 bytes taken straight from the top-level “pga-heap”. It probably contains all kinds of low-level runtime allocations that the process needs for its own purposes. It may be possible to drill down into the subheaps inside that allocation with the Oracle memory top-5 subheap dumping I have written about before.

The 2nd biggest memory user is in category SQL – “QERHJ Bit vector” allocation, 131168 bytes allocated in 8 chunks of ~16kB each (on average). QERHJ should mean Query Execution Row-source Hash-Join and the hash join bit vector is a hash join optimization (somewhat like a bloom filter on hash buckets) – Jonathan Lewis has written about this in his CBO book.

I do have a couple of scripts which automate running the ORAEDBUG command, waiting for a second so that the target process would have a chance to publish its data in the V$PROCESS_MEMORY_DETAIL and then query it. Check out smem_detail.sql and pmem_detail.sql.

Now, let’s look into a real example from a problem case – a stress test environment on Oracle 12c:

SQL> @smem 1516
Display session 1516 memory usage from v$process_memory....

       SID        PID    SERIAL# CATEGORY         ALLOCATED       USED MAX_ALLOCATED     CON_ID
---------- ---------- ---------- --------------- ---------- ---------- ------------- ----------
      1516        120        198 Freeable            786432          0                        0
      1516        120        198 Other            844733773                844733773          0
      1516        120        198 PL/SQL              421064      77296        572344          0
      1516        120        198 SQL                 277536      45904       2348040          0

The Other memory usage of a session has grown to over 800MB!

Let’s drill down deeper. The script warns that it’s experimental and asks you to press enter to continue as it’s using ORADEBUG. I haven’t seen any problems with it, but use it at your own risk (and stay away from critical background processes on production systems)!

SQL> @smem_detail 1516

WARNING! About to run an undocumented ORADEBUG command
for getting heap details.
This script is EXPERIMENTAL, use at your own risk!

Press ENTER to continue, or CTRL+C to cancel

PL/SQL procedure successfully completed.

STATUS
----------
COMPLETE

If the status above is not COMPLETE then you need to wait
for the target process to do some work and re-run the
v$process_memory_detail query in this script manually
(or just take a heapdump level 29 to get heap breakdown
in a tracefile)

       SID CATEGORY        NAME                       HEAP_NAME            BYTES ALLOCATION_COUNT
---------- --------------- -------------------------- --------------- ---------- ----------------
      1516 Other           permanent memory           qmxlu subheap    779697376           203700
      1516 Other           free memory                qmxlu subheap     25960784           202133
      1516 Other           XVM Storage                XVM subheap of     5708032               51
      1516 Other           free memory                session heap       2722944              598
      1516 Other           permanent memory           pga heap            681992               36
      1516 Other           qmushtCreate               qmtmInit            590256                9
      1516 Other           free memory                top uga heap        449024              208
      1516 Other           qmtmltAlloc                qmtmInit            389680             1777
      1516 Other           permanent memory           kolarsCreateCt      316960               15
      1516 Other           free memory                pga heap            306416               17
      1516 Other           miscellaneous                                  297120              105
      1516 Other           permanent memory           qmxtgCreateBuf      279536               73
      1516 Other           free memory                koh dur heap d      239312              134
      1516 Other           kxsFrame4kPage             session heap        232512               56
      1516 Other           permanent memory           qmcxdDecodeIni      228672               21
      1516 Other           permanent memory           qmxtigcp:heap       215936              730
      1516 Other           permanent memory           session heap        189472               28
      1516 Other           free memory                lpxHeap subhea      182760               32
      1516 Other           kfioRqTracer               pga heap            131104                1
      1516 Other           free memory                top call heap       129312                4
      1516 PL/SQL          recursive addr reg file    koh-kghu sessi      110592               10
      1516 Other           free memory                callheap            109856                4
      1516 Other           koh-kghu session heap      session heap         88272               36
      1516 Other           Fixed Uga                  pga heap             72144                1
      1516 PL/SQL          PL/SQL STACK               PLS PGA hp           68256                4
...

Well, there you go – the power of measuring & profiling. Most of that big memory usage comes from something called qmxlu subheap. Now, while this name is cryptic and we don’t know what it means – we are already half-way there, we at least know what to focus on now. We can ignore all the other hundreds of cryptic memory allocations in the output and just try to figure out what “qmxlu subheap” is. A quick MOS search might just tell it and if there are known bugs related to this memory leak, you might just find what’s affecting you right away (as Oracle support analysts may have pasted some symptoms, patch info and workarounds into the bug note):

memory_mos_notes

Indeed, there are plenty of results in MOS and when browsing through them to find one matching our symptoms and environment the closest, I looked into this: ORA-4030 With High Allocation Of “qmxdpls_subheap” (Doc ID 1509914.1). It came up in the search as the support analyst had pasted a recursive subheap dump containing our symptom – “qmxlu subheap” there:

Summary of subheaps at depth 2
5277 MB total:
 5277 MB commented, 128 KB permanent
 174 KB free (110 KB in empty extents),
   2803 MB, 1542119496 heaps:   "               "
   1302 MB, 420677 heaps:   "qmxlu subheap  "
    408 MB, 10096248 chunks:  "qmxdplsArrayGetNI1        " 2 KB free held
    385 MB, 10096248 chunks:  "qmxdplsArrayNI0           " 2 KB free held

In this note, the reference bug had been closed as “not a bug” and hinted that it may be an application issue (an application “object” leak) instead of an internal memory leak that causes this memory usage growth.

Cause:

The cause of this problem has been identified in:
unpublished Bug:8918821 – MEMORY LEAK IN DBMS_XMLPARSER IN QMXDPLS_SUBHEAP
closed as “not a bug”. The problem is caused by the fact that the XML document is created with XMLDOM.CREATEELEMENT, but after creation XMLDOM.FREEDOCUMENT is not called. This causes the XML used heaps to remain allocated. Every new call to XMLDOM.CREATEELEMENT will then allocate a new heap, causing process memory to grow over time, and hence cause the ORA-4030 error to occur in the end.

Solution:

To implement a solution for this issue, use XMLDOM.FREEDOCUMENT to explicitly free any explicitly or implictly created XML document, so the memory associated with that document can be released for reuse.

And indeed, in our case it turned out that it was an application issue – the application did not free the XMLDOM documents after use, slowly accumulating more and more open document memory structures, using more memory and also more CPU time (as, judging by the ALLOCATION_COUNT figure in smem_detail output above, the internal array used for managing the open document structures had grown to 203700). Once the application object leak issue was fixed, the performance and memory usage problem went away.

Summary:

V$PROCESS_MEMORY_DETAIL allows you to conveniently dig deeper into process PGA memory usage. The alternative is to use Oracle heapdumps. A few more useful comments about it are in an old Oracle-L post.

Normally my process memory troubleshooting & drilldown sequence goes like that (usually only steps 1-2 are enough, 3-4 are rarely needed):

  1. v$process / v$process_memory / top / ps
  2. v$sql_workarea_active
  3. v$process_memory_detail or heapdump_analyzer
  4. pmap -x at OS level

#1,2,3 above can show you “session” level memory usage (assuming that you are using dedicated servers with 1-1 relationship between a session and a process) and #4 can show you a different view into the real process memory usage from the OS perspective.

Even though you may see cryptic allocation reason names in the output, if reason X causes 95% of your problem, you’ll need to focus on finding out what X means and don’t need to waste time on anything else. If there’s an Oracle bug involved, a MOS search by top memory consumer names would likely point you to the relevant bug right away.

Oracle troubleshooting is fun!

Note that this year’s only Advanced Oracle Troubleshooting class takes place in the end of April/May 2014, so sign up now if you plan to attend this year!

What the heck are the /dev/shm/JOXSHM_EXT_x files on Linux?

$
0
0

There was an interesting question in Oracle-L about the JOXSHM_EXT_* files in /dev/shm directory on Linux. Basically something like this:

$ ls -l /dev/shm/* | head
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:16 /dev/shm/JOXSHM_EXT_0_LIN112_1409029
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:16 /dev/shm/JOXSHM_EXT_100_LIN112_1409029
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:16 /dev/shm/JOXSHM_EXT_101_LIN112_1409029
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:23 /dev/shm/JOXSHM_EXT_102_LIN112_1409029
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:23 /dev/shm/JOXSHM_EXT_103_LIN112_1409029
-rwxrwx--- 1 oracle dba 36864 Apr 18 10:23 /dev/shm/JOXSHM_EXT_104_LIN112_1409029
...

There are a few interesting MOS articles about these files and how/when to get rid of those (don’t remove any files before reading the notes!), but none of these articles explain why these JOXSHM (and PESHM) files are needed at all:

  • /dev/shm Filled Up With Files In Format JOXSHM_EXT_xxx_SID_xxx (Doc ID 752899.1)
  • Stale Native Code Files Are Being Cached with File Names Such as: JOXSHM_EXT*, PESHM_EXT*, PESLD* or SHMDJOXSHM_EXT* (Doc ID 1120143.1)
  • Ora-7445 [Ioc_pin_shared_executable_object()] (Doc ID 1316906.1)

Here’s an explanation, a bit more elaborated version of what I already posted in Oracle-L:

The JOX files are related to Oracle’s in-database JVM JIT compilation. So, instead of interpreting the JVM bytecode during runtime, Oracle compiles it to archictecture-specific native binary code – just like compiling C code with something like gcc would do. So the CPUs can execute that binary code directly without any interpretation layers in between.

Now the question is that how do we load that binary code into our own process address space – so that the CPUs could execute this stuff directly?

This is why the JOX files exist. When the JIT compilation is enabled (it’s on by default on Oracle 11g), then the java code you access in the database will be compiled to machine code and saved in to the JOX files. Each Java class or method gets its own file (I haven’t checked which is it exactly). And then your Oracle process maps those files into its address space with a mmap() system call. So, any time this compiled java code has to be executed, your Oracle process can just jump to the compiled method address (and return back later).

Let’s do a little test:

SQL> SHOW PARAMETER jit

PARAMETER_NAME                                               TYPE        VALUE
------------------------------------------------------------ ----------- -----
java_jit_enabled                                             boolean     TRUE

Java just-in-time compilation is enabled. Where the JOX files are put by default is OS-specific, but on Linux they will go to /dev/shm (the in-memory filesystem) unless you specify some other directory with the _ncomp_shared_objects_dir parameter (and you’re not hitting one of the related bugs).

So let’s run some Java code in the Database:

SQL> SELECT DBMS_JAVA.GETVERSION FROM dual;

GETVERSION
--------------------------------------------
11.2.0.4.0

After this execution a bunch of JOX files showed up in /dev/shm:

$ ls -l /dev/shm | head
total 77860
-rwxrwx--- 1 oracle dba         4096 May  9 22:13 JOXSHM_EXT_0_LIN112_229381
-rwxrwx--- 1 oracle dba        12288 May  9 22:13 JOXSHM_EXT_10_LIN112_229381
-rwxrwx--- 1 oracle dba         4096 May  9 22:13 JOXSHM_EXT_11_LIN112_229381
-rwxrwx--- 1 oracle dba         8192 May  9 22:13 JOXSHM_EXT_12_LIN112_229381
-rwxrwx--- 1 oracle dba         4096 May  9 22:13 JOXSHM_EXT_13_LIN112_229381
-rwxrwx--- 1 oracle dba         4096 May  9 22:13 JOXSHM_EXT_14_LIN112_229381
...

When I check that process’es address space with pmap, I see that some of these JOX files are also mapped into its address space:

oracle@oel6:~$ sudo pmap -x 33390
33390:   oracleLIN112 (LOCAL=NO)
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000008048000  157584   21268       0 r-x--  oracle
0000000011a2c000    1236     372      56 rw---  oracle
0000000011b61000     256     164     164 rw---    [ anon ]
0000000013463000     400     276     276 rw---    [ anon ]
0000000020000000    8192       0       0 rw-s-  SYSV00000000 (deleted)
0000000020800000  413696       0       0 rw-s-  SYSV00000000 (deleted)
0000000039c00000    2048       0       0 rw-s-  SYSV16117e54 (deleted)
00000000420db000     120     104       0 r-x--  ld-2.12.so
00000000420f9000       4       4       4 r----  ld-2.12.so
00000000420fa000       4       4       4 rw---  ld-2.12.so
00000000420fd000    1604     568       0 r-x--  libc-2.12.so
000000004228e000       8       8       8 r----  libc-2.12.so
0000000042290000       4       4       4 rw---  libc-2.12.so
0000000042291000      12      12      12 rw---    [ anon ]
0000000042296000      92      52       0 r-x--  libpthread-2.12.so
00000000422ad000       4       4       4 r----  libpthread-2.12.so
00000000422ae000       4       4       4 rw---  libpthread-2.12.so
00000000422af000       8       4       4 rw---    [ anon ]
00000000422b3000      12       8       0 r-x--  libdl-2.12.so
00000000422b6000       4       4       4 r----  libdl-2.12.so
00000000422b7000       4       4       4 rw---  libdl-2.12.so
00000000f63b9000       4       4       4 rwxs-  JOXSHM_EXT_88_LIN112_229381
00000000f63ba000      16      16      16 rwxs-  JOXSHM_EXT_91_LIN112_229381
00000000f63be000       4       4       4 rwxs-  JOXSHM_EXT_90_LIN112_229381
00000000f63bf000       4       4       4 rwxs-  JOXSHM_EXT_89_LIN112_229381

...

Note the X and S bits (in the rwxs-) for the JOX mapped segments, this means that the Linux virtual memory manager allows the contents of these mapped files to be directly executed by the CPU and the S means its a shared mapping (other processes can map this binary code into their address spaces as well).

Oracle can also load some of its binary libraries into its address space with the dynamic dlopen() system call, but I verified using strace that the JOX files are “loaded” into the address space with just a mmap() syscall:

33390 open("/dev/shm/JOXSHM_EXT_85_LIN112_229381", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = 8
33390 mmap2(NULL, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, 8, 0) = 0xfffffffff63c2000
33390 close(8)                          = 0
33390 open("/dev/shm/JOXSHM_EXT_87_LIN112_229381", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = 8
33390 mmap2(NULL, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, 8, 0) = 0xfffffffff63c0000
33390 close(8)                          = 0
33390 open("/dev/shm/JOXSHM_EXT_89_LIN112_229381", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = 8
33390 mmap2(NULL, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, 8, 0) = 0xfffffffff63bf000
33390 close(8)                          = 0
...

Just to check if these JOX files really are compiled machine-code instructions:

$ file /dev/shm/JOXSHM_EXT_88_LIN112_229381
/dev/shm/JOXSHM_EXT_88_LIN112_229381: data
$

Oops, the file command doesn’t recognize any specific file format from its contents as it’s a bare slice of machinecode and doesn’t contain the normal object module stuff the .o files would have… let’s try to disassemble this file and see if contains sensible instructions:

$ objdump -b binary -m i386 -D /dev/shm/JOXSHM_EXT_88_LIN112_229381 | head -30

/dev/shm/JOXSHM_EXT_88_LIN112_229381:     file format binary


Disassembly of section .data:

00000000 :
       0:	f7 15 1e 33 00 00    	notl   0x331e
       6:	00 00                	add    %al,(%eax)
       8:	55                   	push   %ebp
       9:	89 e5                	mov    %esp,%ebp
       b:	83 c4 c8             	add    $0xffffffc8,%esp
       e:	8b 45 10             	mov    0x10(%ebp),%eax
      11:	89 c1                	mov    %eax,%ecx
      13:	83 e1 f8             	and    $0xfffffff8,%ecx
      16:	89 4d e0             	mov    %ecx,-0x20(%ebp)
      19:	8b 4d 08             	mov    0x8(%ebp),%ecx
      1c:	89 45 f0             	mov    %eax,-0x10(%ebp)
      1f:	8b 45 0c             	mov    0xc(%ebp),%eax
      22:	89 5d ec             	mov    %ebx,-0x14(%ebp)
      25:	89 7d e8             	mov    %edi,-0x18(%ebp)
      28:	89 75 e4             	mov    %esi,-0x1c(%ebp)
      2b:	c7 45 f8 02 00 00 40 	movl   $0x40000002,-0x8(%ebp)
      32:	8b 91 1d 02 00 00    	mov    0x21d(%ecx),%edx
      38:	8b 7d 14             	mov    0x14(%ebp),%edi
      3b:	89 45 f4             	mov    %eax,-0xc(%ebp)
      3e:	8d 45 f0             	lea    -0x10(%ebp),%eax
      41:	83 c2 04             	add    $0x4,%edx
      44:	89 91 1d 02 00 00    	mov    %edx,0x21d(%ecx)
      4a:	39 d0                	cmp    %edx,%eax
...

This file contains machine code indeed!

So this is how Oracle approaches native JIT compilation for the in-database JVM. In 11g it’s similar to the PL/SQL native compilation too (you’d see various PESHM_ files in /dev/shm). Before 11g, Oracle actually generated intermediate C code for your PL/SQL and then invoked an OS C compiler on it, but in 11g it’s all self-contained in the database code. Pretty cool!

 

Enkitec + Accenture = Even More Awesomeness!

$
0
0

Enkitec is the best consulting firm for hands on implementation, running and troubleshooting your Oracle based systems, especially the engineered systems like Exadata. We have a truly awesome group of people here; many are the best in their field (just look at the list!!!).

This is why I am here.

This is also why Accenture approached us some time ago – and you may already have seen today’s announcement that Enkitec got bought!

We all are now part of Accenture and this opens up a whole lot of new opportunities. I think this is BIG, and I will explain how I see the future (sorry, no Oracle Database internals in this post ;-)

In my opinion the single most important detail of this transaction is that both Enkitec and the folks at Accenture realize that the reason Enkitec is so awesome is that awesome techies want to work here. And we don’t just want to keep it that way – we must keep it that way!

The Enkitec group will not be dissolved into the Accenture. If it were, we would disappear, like a drop in the ocean and Accenture would have lost its investment. Instead we will remain an island in the ocean continuing to provide expert help for our existing and new customers – and in long term help Accenture build additional capability for the massive projects of their customers.

We will not have ten thousand people in our group. Instead we will continue hiring (and retaining) people exactly the way we’ve been – organic growth by having only the best, likeminded people. The main difference is, now with Accenture behind us, we can hire the best people globally, as we’ll have operations in over 50 countries. I understand that we won’t likely even double in size in the next few years – as we plan to stick to hiring only the best.

I think we will have a much, much wider reach now, showing how to do Oracle technology “our way” all around the world. With Accenture behind us, we will be navigating through even larger projects in larger businesses, influencing things earlier and more. And on a more personal note, I’m looking forward to all those 10 rack Exadata and 100TB In-Memory DB Option performance projects ;-)

See you at Enkitec E4 in June!

 

Combining Bloom Filter Offloading and Storage Indexes on Exadata

$
0
0

Here’s a little known feature of Exadata – you can use a Bloom filter computed from a join column of a table to skip disk I/Os against another table it is joined to. This not the same as the Bloom filtering of the datablock contents in Exadata storage cells, but rather avoiding reading in some storage regions from the disks completely.

So, you can use storage indexes to skip I/Os against your large fact table, based on a bloom filter calculated from a small dimension table!

This is useful especially for dimensional star schemas, as your SQL statements might not have direct predicates on your large fact tables at all, all results will be determined by looking up relevant dimension records and then performing a hash join to the fact table (whether you should have some direct predicates against the fact tables, for performance reasons, is a separate topic for some other day :-)

Let me show an example using the SwingBench Order Entry schema. The first output is from Oracle 11.2.0.3 BP21 on Cellsrv 12.1.1.1.0:

SQL> ALTER SESSION SET "_serial_direct_read"=ALWAYS;

Session altered.

SQL> SELECT
  2      /*+ LEADING(c)
  3          NO_SWAP_JOIN_INPUTS(o)
  4          INDEX_RS_ASC(c(cust_email))
  5          FULL(o)
  6          MONITOR
  7      */
  8      *
  9  FROM
 10      soe.customers c
 11    , soe.orders o
 12  WHERE
 13      o.customer_id = c.customer_id
 14  AND c.cust_email = 'florencio@ivtboge.com'
 15  /

CUSTOMER_ID CUST_FIRST_NAME                CUST_LAST_NAME  ...
----------- ------------------------------ --------------
  399999199 brooks                         laxton        

Elapsed: 00:00:55.81

You can ignore the hints in the query, I just used these for get the plan I wanted for my demo. I also forced the serial full segment scans to use direct path reads (and thus attempt Smart Scans on Exadata) using the _serial_direct_read parameter.

Note that while I do have a direct filter predicate on the “small” CUSTOMERS table, I don’t have any predicates on the “large” ORDERS table, so there’s no filter predicate to offload to storage layer on the ORDERS table (but the column projection and HCC decompression can still be offloaded on that table too). Anyway, the query ran in over 55 seconds.

Let’s run this now with PARALLEL degree 2 and compare the results:

SQL> ALTER SESSION FORCE PARALLEL QUERY PARALLEL 2;

Session altered.

SQL> SELECT
  2      /*+ LEADING(c)
  3          NO_SWAP_JOIN_INPUTS(o)
  4          INDEX_RS_ASC(c(cust_email))
  5          FULL(o)
  6          MONITOR
  7      */
  8      *
  9  FROM
 10      soe.customers c
 11    , soe.orders o
 12  WHERE
 13      o.customer_id = c.customer_id
 14  AND c.cust_email = 'florencio@ivtboge.com'
 15  /

CUSTOMER_ID CUST_FIRST_NAME                CUST_LAST_NAME       
----------- ------------------------------ ---------------------
  399999199 brooks                         laxton               

Elapsed: 00:00:03.80

Now the query ran in less than 4 seconds. How come did the same query run close to 15 times faster than in serial? The parallel degree 2 for this simple query should give me max 4 slaves doing the work… 15x speedup indicates that something else has changed as well.

The first thing I would normally suspect is that perhaps the direct path reads were not attempted for the serial query (and Smart Scans did not kick in). That would sure explain the big performance difference… but I did explicitly force the serial direct path reads in my session. Anyway, let’s stop guessing and instead know for sure by measuring these experiments!

SQL Monitoring reports are a good starting point (I have added the MONITOR hint to the query so that the SQL monitoring would kick in immediately also for serial queries).

Here’s the slow serial query:

Global Stats
====================================================================================================
| Elapsed |   Cpu   |    IO    | Application |  Other   | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs  | Bytes | Offload |
====================================================================================================
|      56 |      49 |     6.64 |        0.01 |     0.01 |     2 |     4M | 53927 |  29GB |  14.53% |
====================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=2205859845)
==========================================================================================================================================
| Id |               Operation               |     Name      | Execs |   Rows   | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                       |               |       | (Actual) | Bytes | Offload |   (%)    |        (# samples)        |
==========================================================================================================================================
|  0 | SELECT STATEMENT                      |               |     1 |        1 |       |         |          |                           |
|  1 |   HASH JOIN                           |               |     1 |        1 |       |         |    42.86 | Cpu (24)                  |
|  2 |    TABLE ACCESS BY GLOBAL INDEX ROWID | CUSTOMERS     |     1 |        1 |       |         |          |                           |
|  3 |     INDEX RANGE SCAN                  | CUST_EMAIL_IX |     1 |        1 |       |         |          |                           |
|  4 |    PARTITION HASH ALL                 |               |     1 |     458M |       |         |          |                           |
|  5 |     TABLE ACCESS STORAGE FULL         | ORDERS        |    64 |     458M |  29GB |  14.53% |    57.14 | Cpu (29)                  |
|    |                                       |               |       |          |       |         |          | cell smart table scan (3) |
==========================================================================================================================================

So, the serial query issued 29 GB worth of IO to the Exadata storage cells, but only 14.53% less data was sent back… not that great reduction in the storage interconnect traffic. Also, looks like all 458 Million ORDERS table rows were sent back from the storage cells (as the ORDERS table didn’t have any direct SQL predicates against it). Those rows were then fed to the HASH JOIN parent row source (DB CPU usage!) and it just threw all of the rows away but one. Talk about inefficiency!

Ok, this is the fast parallel query:

Global Stats
==============================================================================================================
| Elapsed | Queuing |   Cpu   |    IO    | Application |  Other   | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs  | Bytes | Offload |
==============================================================================================================
|    4.88 |    0.00 |    0.84 |     4.03 |        0.00 |     0.01 |     2 |     4M | 30056 |  29GB |  99.99% |
==============================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=2121763430)
===============================================================================================================================================
| Id |                 Operation                  |     Name      | Execs |   Rows   | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                            |               |       | (Actual) | Bytes | Offload |   (%)    |        (# samples)        |       
===============================================================================================================================================
|  0 | SELECT STATEMENT                           |               |     3 |        1 |       |         |          |                           |       
|  1 |   PX COORDINATOR                           |               |     3 |        1 |       |         |          |                           |       
|  2 |    PX SEND QC (RANDOM)                     | :TQ10001      |     2 |        1 |       |         |          |                           |       
|  3 |     HASH JOIN                              |               |     2 |        1 |       |         |          |                           |       
|  4 |      BUFFER SORT                           |               |     2 |        2 |       |         |          |                           |       
|  5 |       PX RECEIVE                           |               |     2 |        2 |       |         |          |                           |       
|  6 |        PX SEND BROADCAST                   | :TQ10000      |     1 |        2 |       |         |          |                           |       
|  7 |         TABLE ACCESS BY GLOBAL INDEX ROWID | CUSTOMERS     |     1 |        1 |       |         |          |                           |       
|  8 |          INDEX RANGE SCAN                  | CUST_EMAIL_IX |     1 |        1 |       |         |          |                           |       
|  9 |      PX PARTITION HASH ALL                 |               |     2 |     4488 |       |         |          |                           |       
| 10 |       TABLE ACCESS STORAGE FULL            | ORDERS        |    64 |     4488 |  29GB |  99.99% |   100.00 | cell smart table scan (4) |
===============================================================================================================================================

The fast parallel query still issued 29 GB of IO, but only 0.01% of data (compared to the amount of IOs issued) was sent back from the storage cells, so the offload efficiency for that table scan is 99.99%. Also, only a few thousand rows were returned back from the full table scan (so the throw-away by the hash join later in the plan is smaller).

So, where does the big difference in runtime and offload efficiency metrics come from? It’s the same data, same query (looking for the same values) so how come do we have so different Offload Efficiency %?

The problem with looking into a single metric showing some percentage (of what?!) is that it hides a lot of detail. So, lets get systematic and look into some detailed metrics :-)

I’ll use the Exadata Snapper for this purpose, although I could just list the V$SESSTAT metrics it uses as its source. The ExaSnapper lists various Exadata I/O metrics of the monitored session all in one “chart”, in the same unit (MB) & scale – so it will be easy to compare these different cases. If you do not know what the Exadata Snapper is, then check out this article and video.

Here’s the slow serial query (you may need to scroll the output right to see the MB numbers):

SQL> SELECT * FROM TABLE(exasnap.display_snap(:t1,:t2));

NAME
-------------------------------------------------------------------------------------------------------------------------------------------
-- ExaSnapper v0.81 BETA by Tanel Poder @ Enkitec - The Exadata Experts ( http://www.enkitec.com )
-------------------------------------------------------------------------------------------------------------------------------------------
DB_LAYER_IO                    DB_PHYSIO_BYTES               |##################################################|    29291 MB    365 MB/sec
DB_LAYER_IO                    DB_PHYSRD_BYTES               |##################################################|    29291 MB    365 MB/sec
DB_LAYER_IO                    DB_PHYSWR_BYTES               |                                                  |        0 MB      0 MB/sec
AVOID_DISK_IO                  PHYRD_FLASH_RD_BYTES          |################################################# |    29134 MB    363 MB/sec
AVOID_DISK_IO                  PHYRD_STORIDX_SAVED_BYTES     |                                                  |        0 MB      0 MB/sec
REAL_DISK_IO                   SPIN_DISK_IO_BYTES            |                                                  |      157 MB      2 MB/sec
REAL_DISK_IO                   SPIN_DISK_RD_BYTES            |                                                  |      157 MB      2 MB/sec
REAL_DISK_IO                   SPIN_DISK_WR_BYTES            |                                                  |        0 MB      0 MB/sec
REDUCE_INTERCONNECT            PRED_OFFLOADABLE_BYTES        |##################################################|    29291 MB    365 MB/sec
REDUCE_INTERCONNECT            TOTAL_IC_BYTES                |##########################################        |    24993 MB    312 MB/sec
REDUCE_INTERCONNECT            SMART_SCAN_RET_BYTES          |##########################################        |    24993 MB    312 MB/sec
REDUCE_INTERCONNECT            NON_SMART_SCAN_BYTES          |                                                  |        0 MB      0 MB/sec
CELL_PROC_DEPTH                CELL_PROC_DATA_BYTES          |##################################################|    29478 MB    368 MB/sec
CELL_PROC_DEPTH                CELL_PROC_INDEX_BYTES         |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_TO_CLIENT_BYTES           |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_FROM_CLIENT_BYTES         |                                                  |        0 MB      0 MB/sec

The DB_PHYSDR_BYTES says this session submitted 29291 MB of I/O as far as the DB layer sees it. The PRED_OFFLOADABLE_BYTES says that all of this 29291 MB of I/O was attempted to be done in a smart way (smart scan). The SMART_SCAN_RET_BYTES tells us that 24993 MB worth of data was sent back from the storage cells as a result of the smart scan (about 14.5% less than the amount of IO issued).

However, the PHYRD_STORIDX_SAVED_BYTES shows zero, we couldn’t avoid doing (skip) any I/Os when scanning the ORDERS table. We couldn’t use the storage index as we did not have any direct filter predicates on the ORDERS table.

Anyway, the story is different for the fast parallel query:

SQL> SELECT * FROM TABLE(exasnap.display_snap(:t1,:t2));     

NAME
-------------------------------------------------------------------------------------------------------------------------------------------
-- ExaSnapper v0.81 BETA by Tanel Poder @ Enkitec - The Exadata Experts ( http://www.enkitec.com )              
-------------------------------------------------------------------------------------------------------------------------------------------
DB_LAYER_IO                    DB_PHYSIO_BYTES               |##################################################|    29291 MB   2212 MB/sec
DB_LAYER_IO                    DB_PHYSRD_BYTES               |##################################################|    29291 MB   2212 MB/sec
DB_LAYER_IO                    DB_PHYSWR_BYTES               |                                                  |        0 MB      0 MB/sec
AVOID_DISK_IO                  PHYRD_FLASH_RD_BYTES          |#######################                           |    13321 MB   1006 MB/sec
AVOID_DISK_IO                  PHYRD_STORIDX_SAVED_BYTES     |###########################                       |    15700 MB   1186 MB/sec
REAL_DISK_IO                   SPIN_DISK_IO_BYTES            |                                                  |      270 MB     20 MB/sec
REAL_DISK_IO                   SPIN_DISK_RD_BYTES            |                                                  |      270 MB     20 MB/sec
REAL_DISK_IO                   SPIN_DISK_WR_BYTES            |                                                  |        0 MB      0 MB/sec
REDUCE_INTERCONNECT            PRED_OFFLOADABLE_BYTES        |##################################################|    29291 MB   2212 MB/sec
REDUCE_INTERCONNECT            TOTAL_IC_BYTES                |                                                  |        4 MB      0 MB/sec
REDUCE_INTERCONNECT            SMART_SCAN_RET_BYTES          |                                                  |        4 MB      0 MB/sec
REDUCE_INTERCONNECT            NON_SMART_SCAN_BYTES          |                                                  |        0 MB      0 MB/sec
CELL_PROC_DEPTH                CELL_PROC_DATA_BYTES          |#######################                           |    13591 MB   1027 MB/sec
CELL_PROC_DEPTH                CELL_PROC_INDEX_BYTES         |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_TO_CLIENT_BYTES           |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_FROM_CLIENT_BYTES         |                                                  |        0 MB      0 MB/sec

The DB_PHYSIO_BYTES is still the same (29GB) as in the previous case – as this database layer metric knows only about the amount of I/Os it has requested, not what actually gets (or doesn’t get) done inside the storage cells. SMART_SCAN_RET_BYTES is only 4MB (compared to almost 25GB previously) so evidently there must be some early filtering going on somewhere in the lower layers.

And now to the main topic of this article – the PHYRD_STORIDX_SAVED_BYTES metric (that comes from the cell physical IO bytes saved by storage index metric in V$SESSTAT) shows that we have managed to avoid doing 15700 MB worth of IO completely thanks to the storage indexes. And this is without having any direct SQL filter predicates on the large ORDERS table! How is this possible? Keep reading :-)

Before going deeper, let’s look into the predicate section of these cursors.

The slow serial query:

--------------------------------------------------------------------------------------
| Id  | Operation                           | Name          | E-Rows | Pstart| Pstop |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |               |        |       |       |
|*  1 |  HASH JOIN                          |               |      1 |       |       |
|   2 |   TABLE ACCESS BY GLOBAL INDEX ROWID| CUSTOMERS     |      1 | ROWID | ROWID |
|*  3 |    INDEX RANGE SCAN                 | CUST_EMAIL_IX |      1 |       |       |
|   4 |   PARTITION HASH ALL                |               |    506M|     1 |    64 |
|   5 |    TABLE ACCESS STORAGE FULL        | ORDERS        |    506M|     1 |    64 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   3 - access("C"."CUST_EMAIL"='florencio@ivtboge.com')

The fast parallel query:

-------------------------------------------------------------------------------------------
| Id  | Operation                                | Name          | E-Rows | Pstart| Pstop |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |               |        |       |       |
|   1 |  PX COORDINATOR                          |               |        |       |       |
|   2 |   PX SEND QC (RANDOM)                    | :TQ10001      |      1 |       |       |
|*  3 |    HASH JOIN                             |               |      1 |       |       |
|   4 |     BUFFER SORT                          |               |        |       |       |
|   5 |      PX RECEIVE                          |               |      1 |       |       |
|   6 |       PX SEND BROADCAST                  | :TQ10000      |      1 |       |       |
|   7 |        TABLE ACCESS BY GLOBAL INDEX ROWID| CUSTOMERS     |      1 | ROWID | ROWID |
|*  8 |         INDEX RANGE SCAN                 | CUST_EMAIL_IX |      1 |       |       |
|   9 |     PX PARTITION HASH ALL                |               |    506M|     1 |    64 |
|* 10 |      TABLE ACCESS STORAGE FULL           | ORDERS        |    506M|     1 |    64 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   8 - access("C"."CUST_EMAIL"='florencio@ivtboge.com')
  10 - storage(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))

See how the parallel plan has the storage() predicate with SYS_OP_BLOOM_FILTER in it and the bloom filter :BF0000 will be compared to the hashed CUSTOMER_ID column values when scanning the ORDERS table in storage cells. So, this shows that your session attempts to push the Bloom filter down to the storage layer.

However, merely seeing the Bloom filter storage predicate in the plan doesn’t tell you whether any IOs were avoided thanks to the storage indexes and Bloom filters. The only way to know would be to run the query and look into the cell physical IO bytes saved by storage index metric (or PHYRD_STORIDX_SAVED_BYTES in Exadata Snapper). Unfortunately it will not be easy to measure this at rowsource level when you have multiple smart scans happening in different locations of your execution plan (more about this in the next blog entry :-)

Anyway, how is Oracle able to avoid reading in blocks of one table based on the output of another table involved in the join?

The answer lies in the parameter _bloom_minmax_enabled. It’s description says “enable or disable bloom min max filtering”. This parameter is true by default and on Exadata it means that in addition to computing the Bloom filter bitmap (based on hashed values in the join column) we also keep track of the smallest and biggest value retrieved from the driving table of the join.

In our example, both the Bloom filter bitmap of all matching CUSTOMER_IDs retrieved from the CUSTOMERS table (after any direct predicate filtering) would be sent to the storage cells and also the biggest and smallest CUSTOMER_ID retrieved from the driving table of the join.

In our hand-crafted example where only a single customer was taken from the driving table, only one bit in the bloom filter would be set and both the MIN and MAX CUSTOMER_ID of interest (in the joined ORDERS table) was set to 399999199 (I had about 400 Million customers generated in my benchmark dataset).

So, knowing that we were looking for only CUSTOMER_IDs in the single value “range” of 399999199, the smart scan could start skipping IOs thanks to the storage indexes in memory! You can think of this as runtime BETWEEN predicate generation from a driving table in a join that then gets offloaded to the storage cell and applied when scanning the other table in the join. IOs can be avoided thanks to knowing the range of values we are looking for and then further row-level filtering can be done using the usual Bloom filter bitmap comparison.

 

Note that the above examples are from Oracle 11.2.0.3 – in this version the Bloom filtering features should kick in only for parallel queries (although Jonathan Lewis has blogged about a case when this happens also on 11.2.0.3).

And this is the ultimate reason why the serial execution was so much slower and less efficient than the parallel run – in my demos, on 11.2.0.3 the Bloom filter usage kicked in only when running the query in parallel.

However, this has changed on Oracle 11.2.0.4 – now also serial queries frequently compute bloom filters and push these to the storage, for the usual bloom filtering reasons and for IO skipping by comparing the bloom filter min/max values to the Exadata storage index memory structures.

This is an example from Oracle 11.2.0.4, the plan is slightly different because in this database my test tables are not partitioned (and I didn’t have any indexes on the tables):

------------------------------------------------------------------
| Id  | Operation                   | Name      | E-Rows |E-Bytes|
------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |        |       |
|*  1 |  HASH JOIN                  |           |      1 |   114 |
|   2 |   JOIN FILTER CREATE        | :BF0000   |      1 |    64 |
|*  3 |    TABLE ACCESS STORAGE FULL| CUSTOMERS |      1 |    64 |
|   4 |   JOIN FILTER USE           | :BF0000   |   4581K|   218M|
|*  5 |    TABLE ACCESS STORAGE FULL| ORDERS    |   4581K|   218M|
------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   3 - storage("C"."CUST_EMAIL"='florencio@ivtboge.com')
       filter("C"."CUST_EMAIL"='florencio@ivtboge.com')
   5 - storage(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))

The Bloom bitmap (and MIN/MAX value) would be computed in the step #2 and would then be used by the full table scan in the step #5 – having the filtering info pushed all the way to the storage layer (assuming that the smart scan did kick in).

If you look into the Outline hints section of this plan, you’ll see the PX_JOIN_FILTER hint that has showed up there, this instructs the optimizer to set up and use the bloom filters (and despite the name, it now can be used for serial queries):

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('11.2.0.4')
      DB_VERSION('11.2.0.4')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      FULL(@"SEL$1" "C"@"SEL$1")
      FULL(@"SEL$1" "O"@"SEL$1")
      LEADING(@"SEL$1" "C"@"SEL$1" "O"@"SEL$1")
      USE_HASH(@"SEL$1" "O"@"SEL$1")
      PX_JOIN_FILTER(@"SEL$1" "O"@"SEL$1")
      END_OUTLINE_DATA
  */

So if the CBO doesn’t choose it automatically, you can use this hint for testing whether your queries could potentially benefit from this feature (note that it controls the whole join filter propagation logic, not only the Exadata stuff). You can also use the _bloom_serial_filter parameter to disable this behavior on Exadata (not that you should).

Note that I usually treat the storage index IO savings as just an added performance benefit at the system level, something that allows to keep the disk devices a little less busy, therefore yielding higher overall throughput. I do not design my applications’ performance overly dependent on the storage indexes as it’s not possible to easily control which columns end up in the storage index hashtables and that may give you unpredictable results.

That’s all – hopefully this shed some light on this cool combination of multiple different features (Smart Scans + Hash Joins + Bloom Filters + Storage Indexes).

 

Our take on the Oracle Database 12c In-Memory Option

$
0
0

Enkitec folks have been beta testing the Oracle Database 12c In-Memory Option over the past months and recently the Oracle guys interviewed Kerry OsborneCary Millsap and me to get our opinions. In short, this thing rocks!

We can’t talk much about the technical details before Oracle 12.1.0.2 is officially out in July, but here’s the recorded interview that got published at Oracle website as part of the In-Memory launch today:

Alternatively go to Oracle webpage:

Just scroll down to the Overview section that says: Video: Database Industry Experts Discuss Oracle Database In-Memory (11:10)

I might actually be even more excited about the In-Memory Option than I was excited about Exadata years ago. The In-Memory Option is not just a performance feature, it’s a simplifying feature too. So, now it’s ok to kill your performance problem with hardware, as long as you use it in a smart way :-)


About index range scans, disk re-reads and how your new car can go 600 miles per hour!

$
0
0

Despite the title, this is actually a technical post about Oracle, disk I/O and Exadata & Oracle In-Memory Database Option performance. Read on :)

If a car dealer tells you that this fancy new car on display goes 10 times (or 100 or 1000) faster than any of your previous ones, then either the salesman is lying or this new car is doing something radically different from all the old ones. You don’t just get orders of magnitude performance improvements by making small changes.

Perhaps the car bends space around it instead of moving – or perhaps it has a jet engine built on it (like the one below :-) :

Anyway, this blog entry is a prelude to my upcoming Oracle In-Memory Database Option series and here I’ll explain one of the radical differences between the old way of thinking and modern (In-Memory / Smart Scan) thinking that allow such performance improvements.

To set the scope and and clarify what I mean by the “old way of thinking”: I am talking about reporting, analytics and batch workloads here – and the decades old mantra “if you want more speed, use more indexes”.

I’m actually not going to talk about the In-Memory DB option here – but I am going to walk you through the performance numbers of one index range scan. It’s a deliberately simple and synthetic example executed on my laptop, but it should be enough to demonstrate one important point.

Let’s say we have a report that requires me to visit 20% of rows in an orders table and I’m using an index range scan to retrieve these rows (let’s not discuss whether that’s wise or not just yet). First, I’ll give you some background information about the table and index involved.

My test server’s buffer cache is currently about 650 MB:

SQL> show sga

Total System Global Area 2147483648 bytes
Fixed Size                  2926472 bytes
Variable Size             369100920 bytes
Database Buffers          687865856 bytes
Redo Buffers               13848576 bytes
In-Memory Area           1073741824 bytes

The table I am accessing is a bit less than 800 MB in size, about 100k blocks:

SQL> @seg soe.orders

    SEG_MB OWNER  SEGMENT_NAME   SEGMENT_TYPE    BLOCKS 
---------- ------ -------------  ------------- -------- 
       793 SOE    ORDERS         TABLE           101504 

I have removed some irrelevant output from the output below, I will be using the ORD_WAREHOUSE_IX index for my demo:

SQL> @ind soe.orders
Display indexes where table or index name matches %soe.orders%...

TABLE_OWNER  TABLE_NAME  INDEX_NAME         POS# COLUMN_NAME     DSC
------------ ----------- ------------------ ---- --------------- ----
SOE          ORDERS      ORDER_PK              1 ORDER_ID
                         ORD_WAREHOUSE_IX      1 WAREHOUSE_ID
                                               2 ORDER_STATUS

INDEX_OWNER  TABLE_NAME  INDEX_NAME        IDXTYPE    UNIQ STATUS   PART TEMP  H  LFBLKS       NDK   NUM_ROWS      CLUF LAST_ANALYZED     DEGREE VISIBILIT
------------ ----------- ----------------- ---------- ---- -------- ---- ---- -- ------- --------- ---------- --------- ----------------- ------ ---------
SOE          ORDERS      ORDER_PK          NORMAL/REV YES  VALID    NO   N     3   15801   7148950    7148950   7148948 20140913 16:17:29 16     VISIBLE
             ORDERS      ORD_WAREHOUSE_IX  NORMAL     NO   VALID    NO   N     3   17860      8685    7148950   7082149 20140913 16:18:03 16     VISIBLE

I am going to do an index range scan on the WAREHOUSE_ID column:

SQL> @descxx soe.orders

Col# Column Name                    Null?      Type                      NUM_DISTINCT        Density  NUM_NULLS HISTOGRAM       NUM_BUCKETS Low Value                        High Value
---- ------------------------------ ---------- ------------------------- ------------ -------------- ---------- --------------- ----------- -------------------------------- --------------------------------
   1 ORDER_ID                       NOT NULL   NUMBER(12,0)                   7148950   .00000013988          0                           1 1                                7148950
...
   9 WAREHOUSE_ID                              NUMBER(6,0)                        999   .00100100100          0                           1 1                                999
...

Also, I enabled SQL trace and event 10298 – “ORA-10298: ksfd i/o tracing”, more about that later:

SQL> ALTER SESSION SET EVENTS '10298 trace name context forever, level 1';

Session altered.

SQL> EXEC SYS.DBMS_MONITOR.SESSION_TRACE_ENABLE(waits=>TRUE);

PL/SQL procedure successfully completed.

SQL> SET AUTOTRACE ON STAT

Ok, now we are ready to run the query! (It’s slightly formatted):

SQL> SELECT /*+ MONITOR INDEX(o, o(warehouse_id)) */ 
         SUM(order_total) 
     FROM 
         soe.orders o 
     WHERE 
         warehouse_id BETWEEN 400 AND 599;

Let’s check the basic autotrace figures:

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
    1423335  consistent gets
     351950  physical reads
          0  redo size
        347  bytes sent via SQL*Net to client
        357  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

What?! We have done 351950 physical reads?! This is 351950 blocks read via physical read operations. This is about 2.7 GB worth of IOs done just for this query! Our entire table size was under 800MB and the index size under 150MB. Shouldn’t indexes allow us to visit less blocks than the table size?!

Let’s dig deeper – by breaking down this IO number by execution plan line (using a SQL Monitoring report in this case):

Global Stats
================================================================
| Elapsed |   Cpu   |    IO    | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
================================================================
|      48 |      25 |       23 |     1 |     1M | 352K |   3GB |
================================================================

SQL Plan Monitoring Details (Plan Hash Value=16715356)
=============================================================================================================================================
| Id |               Operation                |       Name       | Execs |   Rows   | Read | Read  | Activity |       Activity Detail       |
|    |                                        |                  |       | (Actual) | Reqs | Bytes |   (%)    |         (# samples)         |
=============================================================================================================================================
|  0 | SELECT STATEMENT                       |                  |     1 |        1 |      |       |          |                             |
|  1 |   SORT AGGREGATE                       |                  |     1 |        1 |      |       |          |                             |
|  2 |    TABLE ACCESS BY INDEX ROWID BATCHED | ORDERS           |     1 |       1M | 348K |   3GB |    96.30 | Cpu (1)                     |
|    |                                        |                  |       |          |      |       |          | db file parallel read (25)  |
|  3 |     INDEX RANGE SCAN                   | ORD_WAREHOUSE_IX |     1 |       1M | 3600 |  28MB |     3.70 | db file sequential read (1) |
=============================================================================================================================================

So, most of these IOs come from accessing the table (after fetching relevant ROWIDs from the index). 96% of response time of this query was also spent in that table access line. We have done about ~348 000 IO requests for fetching blocks from this table. This is over 3x more blocks than the entire table size! So we must be re-reading some blocks from disk again and again for some reason.

Let’s confirm if we are having re-reads. This is why I enabled the SQL trace and event 10298. I can just post-process the tracefile and see if IO operations with the same file# and block# combination do show up.

However, using just SQL trace isn’t enough because multiblock read wait events don’t show all blocks read (you’d have to infer this from the starting block# and count#), the “db file parallel read” doesn’t show any block#/file# info at all in SQL Trace (as this “vector read” wait event encompasses multiple different block reads under a single wait event).

The classic single block read has the file#/block# info:

WAIT #139789045903344: nam='db file sequential read' ela= 448 file#=2 block#=1182073 blocks=1 obj#=93732 tim=156953721029

The parallel read wait events don’t have individual file#/block# info (just total number of files/blocks involved):

WAIT #139789045903344: nam='db file parallel read' ela= 7558 files=1 blocks=127 requests=127 obj#=93696 tim=156953729450

Anyway, because we had plenty of db file parallel read waits that don’t show all the detail in SQL Trace, I also enabled the event 10298 that gives us following details below (only tiny excerpt below):

...
ksfd_osdrqfil:fob=0xce726160 bufp=0xbd2be000 blkno=1119019 nbyt=8192 flags=0x4
ksfdbio:rq=0x7f232c4edb00 fob=0xce726160 aiopend=126
ksfd_osdrqfil:fob=0xce726160 bufp=0x9e61a000 blkno=1120039 nbyt=8192 flags=0x4
ksfdbio:rq=0x7f232c4edd80 fob=0xce726160 aiopend=127
ksfdwtio:count=127 aioflags=0x500 timeout=2147483647 posted=(nil)
...
ksfdchkio:ksfdrq=0x7f232c4edb00 completed=1
ksfdchkio:ksfdrq=0x7f232c4edd80 completed=0
WAIT #139789045903344: nam='db file parallel read' ela= 6872 files=1 blocks=127 requests=127 obj#=93696 tim=156953739197

So, on Oracle 12.1.0.2 on Linux x86_64 with xfs filesystem with async IO enabled and filesystemio_options = SETALL we get the “ksfd_osdrqfil” trace entries to show us the block# Oracle read from a datafile. It doesn’t show the file# itself, but it shows the accessed file state object address (FOB) in SGA and as it was always the same in the tracefile, I know duplicate block numbers listed in trace would be for the same datafile (and not for a block with the same block# in some other datafile). And the tablespace I used for my test had a single datafile anyway.

Anyway, I wrote a simple script to summarize whether there were any disk re-reads in this tracefile (of a select statement):

$ grep ^ksfd_osdrqfil LIN121_ora_11406.trc | awk '{ print $3 }' | sort | uniq -c | sort -nr | head -20
     10 blkno=348827
     10 blkno=317708
      9 blkno=90493
      9 blkno=90476
      9 blkno=85171
      9 blkno=82023
      9 blkno=81014
      9 blkno=80954
      9 blkno=74703
      9 blkno=65222
      9 blkno=63899
      9 blkno=62977
      9 blkno=62488
      9 blkno=59663
      9 blkno=557215
      9 blkno=556581
      9 blkno=555412
      9 blkno=555357
      9 blkno=554070
      9 blkno=551593
...

Indeed! The “worst” blocks have been read in 10 times – all that for a single query execution.

I only showed 20 top blocks here, but even when I used “head -10000″ and “head -50000″ above, I still saw blocks that had been read in to buffer cache 8 and 4 times respectively.

Looking into earlier autotrace metrics, my simple index range scan query did read in over 3x more blocks than the total table and index size combined (~350k blocks read while the table had only 100k blocks)! Some blocks have gotten kicked out from buffer cache and have been re-read back into cache later, multiple times.

Hmm, let’s think further: We are accessing only about 20% of a 800 MB table + 150 MB index, so the “working set” of datablocks used by my query should be well less than my 650 MB buffer cache, right? And as I am the only user in this database, everything should nicely fit and stay in buffer cache, right?

Actually, both of the arguments above are flawed:

  1. Accessing 20% of rows in a table doesn’t automatically mean that we need to visit only 20% blocks of that table! Maybe all of the tables’s blocks contain a few of the rows this index range scan needs? So we might need to visit all of that table’s blocks (or most of them) and extract only a few matching rows from each block. But nevertheless, the “working set” of required blocks for this query would include almost all of the table blocks, not only 20%. We must read all of them in at some point in the range scan.So, the matching rows in table blocks are not tightly packed and physically in correspondence with the index range scan’s table access driving order, but are potentially “randomly” scattered all over the table.This means that an index range scan may come back and access some data block again and again to get a yet-another row from it when the ROWID entries in index leaf blocks point there. This is what I call buffer re-visits(Now scroll back up and see what is that index’es clustering factor :-)

  2. So what, all the buffer re-visits should be really fast as the previously read block is going to be in buffer cache, right?Well, not really. Especially when the working set of blocks read is bigger than buffer cache. But even if it is smaller, the Oracle buffer cache isn’t managed using basic LRU replacement logic (since 8.1.6). New blocks that get read in to buffer cache will be put into the middle of the “LRU” list and they work their way up to the “hot” end only if they are touched enough times before someone manages to flush them out. So even if you are a single user of the buffer cache, there’s a chance that some just recently read blocks get aged out from buffer cache – by the same query still running – before they get hot enough. And this means that your next buffer re-visit may turn into a disk block re-read that we saw in the tracefiles.If you combine this with the reality of production systems where there’s a thousand more users trying to do what you’re doing, at the same time, it becomes clear that you’ll be able to use only a small portion of the total buffer cache for your needs. This is why people sometimes configure KEEP pools – not that the KEEP pool is somehow able to keep more blocks in memory for longer per GB of RAM, but simply for segregating the less important troublemakers from more important… troublemakers :)

 

So what’s my point here – in the context of this blog post’s title?

Let’s start from Exadata – over the last years it has given many customers order(s) of magnitude better analytics, reporting and batch performance compared to their old systems, if done right of course. In other words, instead of indexing even more, performing wide index range scans with millions of random block reads and re-reads, they ditched many indexes and started doing full table scans. Full table scans do not have such “scaling problems” like a wide index range scan (or a “wide” nested loop join driving access to another table). In addition you got all the cool stuff that goes really well with full scans – multiblock reads, deep prefetching, partition-wise hash joins, partition pruning and of course all the throughput and Smart Scan magic on Exadata).

An untuned complex SQL on a complex schema with lots of non-ideal indexes may end up causing a lot of “waste IO” (don’t have a better term) and similarly CPU usage too. And often it’s not simple to actually fix the query – as it may end up needing a significant schema adjustment/redesign that would require also changing the application code in many different places (ain’t gonna happen). With defaulting reporting to full table scans, you can actually eliminate a lot of such waste, assuming that you have a high-througput – and ideally smart – IO subsystem. (Yes, there are always exceptions and special cases).

We had a customer who had a reporting job that ran almost 2000x faster after moving to Exadata (from 17 hours to 30 seconds or something like that). Their first reaction was: “It didn’t run!” Indeed it did run and it ran correctly. Such radical improvement came from the fact that the new system – compared to the old system – was doing multiple things radically better. It wasn’t just an incremental tweak of adding a hint or a yet another index without daring to do more significant changes.

In this post I demoed just one of the problems that’s plaguing many of the old-school Oracle DW and reporting systems. While favoring full table scanning had always been counterintuitive for most Oracle shops out there, it was the Exadata’s hardware, software and also the geek-excitement surrounding it, what allowed customers to take the leap and switch from the old mindset to new. I expect the same from the Oracle In-Memory Database Option. More about this in a following post.

 

My presentations at OOW 2014 (See you there!)

$
0
0

Here’s where I will hang out (and in some cases speak) during the OOW:

Sunday, Sep 28 3:30pm – Moscone South – 310

Monday, Sep 29 8:30am – 4:00pm - Creativity Museum

  • I will mostly hang out at the OakTableWorld satellite event and listen to the awesome talks there.

Tuesday, Sep 30 10:00am – Creativity Museum

  • I will speak about Hacking Oracle 12c for an hour at OakTableWorld (random stuff about the first things I researched when Oracle 12c was released)
  • I also plan to hang out there for most of the day, so see you there!

Wednesday, Oct 1 – 3:00pm – Jillian’s

  • I’ll be at Enkitec’s “office” (read: we’ll have beer) in Jillian’s (on 4th St between Mission/Howard) from 3pm onwards on Wednesday, so, come by for a chat.
  • Right after Enkitec’s office hours I’ll head to the adjacent room for the OTN Bloggers meetup and this probably means more beer & chat.

Thursday, Oct 2 – 10:45am – Moscone South – 104

  • Oracle In-Memory Database In Action
  • In this presentation Kerry and I will walk you through the performance differences when swithching from an old DW/reporting system (on a crappy I/O subsystem) all the way to having your data cached in Oracle’s In-Memory Column Store – with all the Oracle 12.1.0.2’s performance bells and whistles enabled. It will be awesome – see you there! ;-)

 

Oracle In-Memory Column Store Internals – Part 1 – Which SIMD extensions are getting used?

$
0
0

This is the first entry in a series of random articles about some useful internals-to-know of the awesome Oracle Database In-Memory column store. I intend to write about Oracle’s IM stuff that’s not already covered somewhere else and also about some general CPU topics (that are well covered elsewhere, but not always so well known in the Oracle DBA/developer world).

Before going into further details, you might want to review the Part 0 of this series and also our recent Oracle Database In-Memory Option in Action presentation with some examples. And then read this doc by Intel if you want more info on how the SIMD registers and instructions get used.

There’s a lot of talk about the use of your CPUs’ SIMD vector processing capabilities in the Oracle inmemory module, let’s start by checking if it’s enabled in your database at all. We’ll look into Linux/Intel examples here.

The first generation of SIMD extensions in Intel Pentium world were called MMX. It added 8 new XMMn registers, 64 bits each. Over time the registers got widened, more registers and new features were added. The extensions were called Streaming SIMD Extensions (SSE, SSE2, SSSE3, SSE4.1, SSE4.2) and Advanced Vector Extensions (AVX and AVX2).

The currently available AVX2 extensions provide 16 x 256 bit YMMn registers and the AVX-512 in upcoming King’s Landing microarchitecture (year 2015) will provide 32 x 512 bit ZMMn registers for vector processing.

So how to check which extensions does your CPU support? On Linux, the “flags” column in /proc/cpuinfo easily provides this info.

Let’s check the Exadatas in our research lab:

Exadata V2:

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

So the highest SIMD extension support on this Exadata V2 is SSE4.2 (No AVX!)

Exadata X2:

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

Exadata X2 also has SSE4.2 but no AVX.

Exadata X3:

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
avx
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

The Exadata X3 supports the newer AVX too.

My laptop (Macbook Pro late 2013):
The Exadata X4 has not yet arrived to our lab, so I’m using my laptop as an example of a latest available CPU with AVX2:

Update: Jason Arneil commented that the X4 does not have AVX2 capable CPUs (but the X5 will)

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
avx
avx2
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

The Core-i7 generation supports everything up to the current AVX2 extension set.

So, which extensions is Oracle actually using? Let’s check!

As Oracle needs to run different binary code on CPUs with different capabilities, some of the In-Memory Data (kdm) layer code has been duplicated into separate external libraries – and then gets dynamically loaded into Oracle executable address space as needed. You can run pmap on one of your Oracle server processes and grep for libshpk:

$ pmap 21401 | grep libshpk
00007f0368594000   1604K r-x--  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpksse4212.so
00007f0368725000   2044K -----  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpksse4212.so
00007f0368924000     72K rw---  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpksse4212.so

My (educated) guess is that the “shpk” in libshpk above stands for oS dependent High Performance [K]ompression. “s” prefix normally means platform dependent (OSD) code and this low-level SIMD code sure is platform and CPU microarchitecture version dependent stuff.

Anyway, the above output from an Exadata X2 shows that SSE4.2 SIMD HPK libraries are used on this platform (and indeed, X2 CPUs do support SSE4.2, but not AVX).

Let’s list similar files from $ORACLE_HOME/lib:

$ cd $ORACLE_HOME/lib
$ ls -l libshpk*.so
-rw-r--r-- 1 oracle oinstall 1818445 Jul  7 04:16 libshpkavx12.so
-rw-r--r-- 1 oracle oinstall    8813 Jul  7 04:16 libshpkavx212.so
-rw-r--r-- 1 oracle oinstall 1863576 Jul  7 04:16 libshpksse4212.so

So, there are libraries for AVX and AVX2 in the lib directory too (the “12” suffix for all file names just means Oracle version 12). The AVX2 library is almost empty though (and the nm/objdump commands don’t show any Oracle functions in it, unlike in the other files).

Let’s run pmap on a process in my new laptop (which supports AVX and AVX2 ) to see if the AVX2 library gets used:

$ pmap 18969 | grep libshpk     
00007f85741b1000   1560K r-x-- libshpkavx12.so
00007f8574337000   2044K ----- libshpkavx12.so
00007f8574536000     72K rw--- libshpkavx12.so

Despite my new laptop supporting AVX2, only the AVX library is used (the AVX2 library is named libshpkavx212.so). So it looks like the AVX2 extensions are not used yet in this version (it’s the first Oracle 12.1.0.2 GA release without any patches). I’m sure this will be added soon, along with more features and bugfixes.

To be continued …

Public Appearances 2015

$
0
0

Here’s where I’ll hang out in the following months:

11-12 Feb 2015: IOUG Exadata SIG Virtual Conference (free online event)

  • Presentation: Exadata Performance: Latest Improvements and Less Known Features
  • It’s a free online event, so sign up here

18-19 Feb 2015: RMOUG Training Days (in Denver)

  • I won’t speak there this year, but plan to hang out on Wednesday evening and drink beer
  • More info here

1-5 March 2015: Hotsos Symposium 2015

31 May – 2 June 2015: Enkitec E4

  • Even more awesome Exadata (and now also Hadoop) content there!
  • I plan to speak there again, about Exadata performance and/or integrating Oracle databases with Hadoop
  • More info here

Advanced Oracle Troubleshooting v3.0 training

  • One of the reasons why I’ve been so quiet in recent months is that I’ve been rebuilding my entire Advanced Oracle Troubleshooting training material from ground up.
  • This new seminar focuses on systematic Oracle troubleshooting and internals of database versions all the way to Oracle 12c.
  • I will launch the AOT seminar v3.0 in early March – you can already register your interest here!

 

Oracle Exadata Performance: Latest Improvements and Less Known Features

$
0
0

Here are the slides of a presentation I did at the IOUG Virtual Exadata conference in February. I’m explaining the basics of some new Oracle 12c things related to Exadata, plus current latest cellsrv improvements like Columnar Flash Cache and IO skipping for Min/Max retrieval using Storage Indexes:

Note that Christian Antognini and Roger MacNicol have written separate articles about some new features:

Enjoy!

 

Sqlplus is my second home, part 8: Embedding multiple sqlplus arguments into one variable

$
0
0

I’ve updated some of my ASH scripts to use these 4 arguments in a standard way:

  1. What ASH columns to display (and aggregate by)
  2. Which ASH rows to use for the report (filter)
  3. Time range start
  4. Time range end

So this means whenever I run ashtop (or dashtop) for example, I need to type in all 4 parameters. The example below would show top SQL_IDs only for user SOE sessions from last hour of ASH samples:

SQL> @ashtop sql_id username='SOE' sysdate-1/24 sysdate

    Total
  Seconds     AAS %This   SQL_ID        FIRST_SEEN          LAST_SEEN           DIST_SQLEXEC_SEEN
--------- ------- ------- ------------- ------------------- ------------------- -----------------
     2271      .6   21% | 56pwkjspvmg3h 2015-03-29 13:13:16 2015-03-29 13:43:34               145
     2045      .6   19% | gkxxkghxubh1a 2015-03-29 13:13:16 2015-03-29 13:43:14               149
     1224      .3   11% | 29qp10usqkqh0 2015-03-29 13:13:25 2015-03-29 13:43:32               132
      959      .3    9% | c13sma6rkr27c 2015-03-29 13:13:19 2015-03-29 13:43:34               958
      758      .2    7% |               2015-03-29 13:13:16 2015-03-29 13:43:31                 1

When I want more control and specify a fixed time range, I can just use the ANSI TIMESTAMP (or TO_DATE) syntax:

SQL> @ashtop sql_id username='SOE' "TIMESTAMP'2015-03-29 13:00:00'" "TIMESTAMP'2015-03-29 13:15:00'"

    Total
  Seconds     AAS %This   SQL_ID        FIRST_SEEN          LAST_SEEN           DIST_SQLEXEC_SEEN
--------- ------- ------- ------------- ------------------- ------------------- -----------------
      153      .2   22% | 56pwkjspvmg3h 2015-03-29 13:13:29 2015-03-29 13:14:59                 9
      132      .1   19% | gkxxkghxubh1a 2015-03-29 13:13:29 2015-03-29 13:14:59                 8
       95      .1   14% | 29qp10usqkqh0 2015-03-29 13:13:29 2015-03-29 13:14:52                 7
       69      .1   10% | c13sma6rkr27c 2015-03-29 13:13:31 2015-03-29 13:14:58                69
       41      .0    6% |               2015-03-29 13:13:34 2015-03-29 13:14:59                 1

Note that the arguments 3 & 4 above are in double quotes as there’s a space within the timestamp value. Without the double-quotes, sqlplus would think the script has total 6 arguments due to the spaces.

I don’t like to type too much though (every character counts!) so I was happy to see that the following sqlplus hack works. I just defined pairs of arguments as sqlplus DEFINE variables as seen below (also in init.sql now):

  -- geeky shorcuts for producing date ranges for various ASH scripts
  define     min="sysdate-1/24/60 sysdate"
  define  minute="sysdate-1/24/60 sysdate"
  define    5min="sysdate-1/24/12 sysdate"
  define    hour="sysdate-1/24 sysdate"
  define   2hours="sysdate-1/12 sysdate"
  define  24hours="sysdate-1 sysdate"
  define      day="sysdate-1 sysdate"
  define    today="TRUNC(sysdate) sysdate"

And now I can type just 3 arguments instead of 4 when I run some of my scripts and want some predefined behavior like seeing last 5 minutes’ activity:

SQL> @ashtop sql_id username='SOE' &5min

    Total
  Seconds     AAS %This   SQL_ID        FIRST_SEEN          LAST_SEEN           DIST_SQLEXEC_SEEN
--------- ------- ------- ------------- ------------------- ------------------- -----------------
      368     1.2   23% | gkxxkghxubh1a 2015-03-29 13:39:34 2015-03-29 13:44:33                37
      241      .8   15% | 56pwkjspvmg3h 2015-03-29 13:40:05 2015-03-29 13:44:33                25
      185      .6   12% | 29qp10usqkqh0 2015-03-29 13:39:40 2015-03-29 13:44:33                24
      129      .4    8% | c13sma6rkr27c 2015-03-29 13:39:35 2015-03-29 13:44:32               129
      107      .4    7% |               2015-03-29 13:39:34 2015-03-29 13:44:33                 1

That’s it, I hope this hack helps :-)

By the way – if you’re a command line & sqlplus fan, check out the SQLCL command line “new sqlplus” tool from the SQL Developer team! (you can download it from the SQL Dev early adopter page for now).

 

Advanced Oracle Troubleshooting Guide – Part 12: control file reads causing enq: SQ – contention waits?

$
0
0

Vishal Desai systematically troubleshooted an interesting case where the initial symptoms of the problem showed a spike of enq: SQ – contention waits, but he dug deeper – and found the root cause to be quite different. He followed the blockers of waiting sessions manually to reach the root cause – and also used my @ash/ash_wait_chains.sql and @ash/event_hist.sql scripts to extract the same information more conveniently (note that he had modified the scripts to take AWR snap_ids as time range parameters instead of the usual date/timestamp):

Definitely worth a read if you’re into troubleshooting non-trivial performance problems :)


Old ventures and new adventures

$
0
0

I have some news, two items actually.

First, today (it’s still 18th June in California) is my blog’s 8th anniversary!

I wrote my first blog post, about Advanced Oracle Troubleshooting, exactly 8 years ago, on 18th June 2007 and have written 229 blog posts since. I had started writing and accumulating my TPT script collection a couple of years earlier and now it has over 1000 files in it! And no, I don’t remember what all of them do and even why I had written them. Also I haven’t yet created an index/documentation for all of them (maybe on the 10th anniversary? ;)

Thanks everyone for your support, reading, commenting and the ideas we’ve exchanged over all these years, it’s been awesome to learn something new every single day!

You may have noticed that I haven’t been too active in online forums nor blogging much in the last couple of years, which brings me to the second news item(s):

I’ve been heavily focusing on Hadoop. It is the future. It will win, for the same reasons Linux won. I moved to US over a year ago and am currently in San Francisco. The big data hype is the biggest here. Except it’s not hype anymore; and Hadoop is getting enterprise-ready.

I am working on a new startup. I am the CEO who still occasionally troubleshoots stuff (must learn something new every day!). We officially incorporated some months ago, but our first developers in Dallas and London have been busy in the background for over a year. By now we are beta testing with our most progressive customers ;-) We are going to be close partners with old and new friends in modern data management space and especially the awesome folks in Accenture Enkitec Group.

The name is Gluent. We glue together the old and new worlds in enterprise IT. Relational databases vs. Hadoop. Legacy ETL vs. Spark. SAN storage vs. the cloud. Jungles of data feeds vs. a data lake. I’m not going to tell you any more as we are still in stealth mode ;-)

Now, where does this leave Oracle technology? Well, I think it still kicks ass and it ain’t going away! In fact we are betting on it. Hadoop is here to stay, but your existing systems aren’t going away any time soon.

I wouldn’t want to run my critical ERP or complex transactional systems on anything other than Oracle. Want real time in-memory reporting on your existing Oracle OLTP system – with immediate consistency, not a multi-second lag: Oracle. Oracle is the king of complex OLTP and I don’t see it changing soon.

So, thanks for reading all the way to the end – and expect to hear much more about Gluent in the future! You can follow @GluentInc Twitter handle to be the first to hear any further news :-)

 

The Hybrid World is Coming

$
0
0

Here’s the video of E4 keynote we delivered together with Kerry Osborne a few weeks ago.

It explains what we see is coming, at a high level, from long time Oracle database professionals’ viewpoint and using database terminology (as the E4 audience is all Oracle users like us).

However, this change is not really about Oracle database world, it’s about a much wider shift in enterprise computing: modern Hadoop data lakes and clouds are here to stay. They are already taking over many workloads traditionally executed on in-house RDBMS systems on SAN storage arrays – especially all kinds of reporting and analytics. Oracle is just one of the many vendors affected by all this and they’ve also jumped onto the Hadoop bandwagon.

However, it would be naive to to think that Hadoop would somehow replace all your transactional or ERP systems or existing application code with thousands of complex SQL reports. Many of the traditional systems aren’t going away any time soon.

But the hybrid world is coming. It’s been a very good idea for Oracle DBAs to additionally learn Linux over the last 5-10 years, now is pretty much the right time to start learning Hadoop too. More about this in a future article ;-)

Check out the keynote video here:

Enjoy :-)

RAM is the new disk – and how to measure its performance – Part 1 – Introduction

$
0
0

RAM is the new disk, at least in the In-Memory computing world.

No, I am not talking about Flash here, but Random Access Memory – RAM as in SDRAM. I’m by far not the first one to say it. Jim Gray wrote this in 2006: “Tape is dead, disk is tape, flash is disk, RAM locality is king” (presentation)

Also, I’m not going to talk about how RAM is faster than disk (everybody knows that), but in fact how RAM is the slow component of an in-memory processing engine.

I will use Oracle’s In-Memory column store and the hardware performance counters in modern CPUs for drilling down into the low-level hardware performance metrics about CPU efficiency and memory access.

But let’s first get started by looking a few years into past into the old-school disk IO and index based SQL performance bottlenecks :)

Have you ever optimized a SQL statement by adding all the columns it needs into a single index and then letting the database do a fast full scan on the “skinny” index as opposed to a full table scan on the “fat” table? The entire purpose of this optimization was to reduce disk IO and SAN interconnect traffic for your critical query (where the amount of data read would have made index range scans inefficient).

This special-purpose approach would have benefitted your full scan in two ways:

  1. In data warehouses, a fact table may contain hundreds of columns, so an index with “only” 10 columns would be much smaller. Full “table” scanning the entire skinny index would still generate much less IO traffic than the table scan, so it became a viable alternative to wide index range scans and some full table scans (and bitmap indexes with star transformations indeed benefitted from the “skinniness” of these indexes too).
  2. As the 10-column index segment is potentially 50x smaller than the 500-column fact table, it might even fit entirely into buffer cache, should you decide so.

This is all thanks to physically changing the on-disk data structure, to store a copy of only the data I need in one place (column pre-projection?) and store these elements close to each other (locality).

Note that I am not advocating the use of this as a tuning technique here, but just explaining what was sometimes used to make a handful critical queries fast at the expense of the disk space, DML, redo and buffer cache usage overhead of having another index – and why it worked.

Now, why would I worry about this at all in a properly warmed up inmemory database, where the disk IO is not at the critical path of data retrieval at all? Well, now that we have removed the disk IO bottleneck, we inevitably hit the next slowest component as a bottleneck and this is … RAM.

Sequentially scanning RAM is slow. Randomly accessing RAM lines is even slower! Of course this slowness is all relative to the modern CPUs that are capable of processing billions of instructions per core every second.

Back to Oracle’s In-Memory column store example: Despite all the marketing talk about loop vectorization with CPU SIMD processing extensions, the most fundamental change required for “extreme performance” is simply about reducing the data traffic between RAM and CPUs.

This is why I said “SIMD would be useless if you waited on main memory all the time” at the Oracle Database In-Memory in Action presentation at Oracle OpenWorld (Oct 2014):

Oracle In-Memory in Action presentation

The “secret sauce” of Oracle’s in-memory scanning engine is the columnar storage of data, the ability to (de)compress it cheaply and accessing only the filtered columns’ memory first, before even touching any of the other projected columns required by the query. This greatly reduces the slow RAM traffic, just like building that skinny index reduced disk I/O traffic back in the on-disk database days. The SIMD instruction set extensions are just icing on the columnar cake.

So far this is just my opinion, but in the next part I will show you some numbers too!

 

We are hiring!

$
0
0

Gluent – where I’m a cofounder & CEO – is hiring awesome developers and (big data) infrastructure specialists in US and UK!

We are still in stealth mode, so won’t be detailing publicly what exactly we are doing ;-)

However, it is evident that the modern data platforms (for example Hadoop) with their scalability, affordability-at-scale and freedom to use many different processing engines on open data formats are turning enterprise IT upside down.

This shift has already been going on for years in large internet & e-commerce companies and small startups, but now the shockwave is arriving to all traditional enterprises too. And every single one of them must accept it, in order to stay afloat and win in the new world.

Do you want to be part of the new world?

 

 

RAM is the new disk – and how to measure its performance – Part 2 – Tools

$
0
0

In the previous article I explained that the main requirement for high-speed in-memory data scanning is column-oriented storage format for in-memory data. SIMD instruction processing is just icing on the cake. Let’s dig deeper. This is a long post, you’ve been warned.

Test Environment

I will cover full test results in the next article in this series. First, let’s look into the test setup, environment and what tools I used for peeking inside CPU hardware.

I was running the tests on a relatively old machine with 2 CPU sockets, with 6-core CPUs in each socket (2s12c24t):

$ egrep "MHz|^model name" /proc/cpuinfo | sort | uniq -c
     24 cpu MHz		: 2926.171
     24 model name	: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz

The CPUs support SSE4.2 SIMD extensions (but not the newer AVX stuff):

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

Even though the /proc/cpuinfo above shows the CPU clock frequency as 2.93GHz, these CPUs have Intel Turboboost feature that allows some cores run at up to 3.33GHz frequency when not all cores are fully busy and the CPUs aren’t too hot.

Indeed, the turbostat command below shows that the CPU core executing my Oracle process was running at 3.19GHz frequency:

# turbostat -p sleep 1
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             6.43 3.02 2.93   0  93.57   0.00   0.00   59   0.00   0.00
 0   0   0   4.49 3.19 2.93   0  95.51   0.00   0.00   46   0.00   0.00
 0   1   1  10.05 3.19 2.93   0  89.95   0.00   0.00   50
 0   2   2   2.48 3.19 2.93   0  97.52   0.00   0.00   45
 0   8   3   2.05 3.19 2.93   0  97.95   0.00   0.00   44
 0   9   4   0.50 3.20 2.93   0  99.50   0.00   0.00   50
 0  10   5 100.00 3.19 2.93   0   0.00   0.00   0.00   59
 1   0   6   6.25 2.23 2.93   0  93.75   0.00   0.00   44   0.00   0.00
 1   1   7   3.93 2.04 2.93   0  96.07   0.00   0.00   43
 1   2   8   0.82 2.15 2.93   0  99.18   0.00   0.00   44
 1   8   9   0.41 2.48 2.93   0  99.59   0.00   0.00   41
 1   9  10   0.99 2.35 2.93   0  99.01   0.00   0.00   43
 1  10  11   0.76 2.36 2.93   0  99.24   0.00   0.00   44

I will come back to this CPU frequency turbo-boosting later when explaining some performance metrics.

I ran the experiments in Oct/Nov 2014, so used a relatively early Oracle 12.1.0.2.1 version with a bundle patch (19189240) for in-memory stuff.

The test was deliberately very simple as I was researching raw in-memory scanning and filtering speed and was not looking into join/aggregation performance. I was running the query below with different hints and parameters to change access path options:

SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0

I used the CUSTOMERS table of Swingbench Sales History schema. I deliberately didn’t use COUNT(*), but COUNT(col) on an actual column “cust_valid” that was nullable, so values in that actual column had to be accessed for correct counting.

Also, I picked the last column in the table as accessing columns in the physical “end” of a row (in row-oriented storage format) would cause more memory/cache accesses and CPU execution branch jumps due to the run-length encoded structure of a row in a datablock. Of course this depends on number of columns and width of the row too, plus hardware characteristics like cache line size (64 bytes on my machine).

Anyway, querying the last column helps to illustrate better what kind of overhead you may be suffering from when filtering that 500-column fact table using columns in the end of it.

SQL> @desc ssh.customers_nopart
           Name                            Null?    Type
           ------------------------------- -------- ----------------------------
    1      CUST_ID                         NOT NULL NUMBER
    2      CUST_FIRST_NAME                 NOT NULL VARCHAR2(20)
    3      CUST_LAST_NAME                  NOT NULL VARCHAR2(40)
...
   22      CUST_EFF_TO                              DATE
   23      CUST_VALID                               VARCHAR2(1)

The table has 69,642,625 rows in it and its segment size is 1613824 blocks / 12608 MB on disk (actual used space in it was slightly lower due to some unused blocks in the segment). I set the table PCTFREE to zero to use all space in the blocks. I also created a HCC-compressed copy of the same table for comparison reasons.

SQL> @seg tanel.customers_nopart

  SEG_MB OWNER  SEGMENT_NAME               SEGMENT_TYPE      BLOCKS
-------- ------ -------------------------  ------------- ----------
   12608 TANEL  CUSTOMERS_NOPART           TABLE            1613824
    6416 TANEL  CUSTOMERS_NOPART_HCC_QL    TABLE             821248

I made sure that the test tables were completely cached in Oracle buffer cache to eliminate any physical IO component from tests and also enabled in-memory columnar caching for the CUSTOMERS_NOPART table.

SQL> @imseg %.%

    SEG_MB   INMEM_MB  %POP IMSEG_OWNER   IMSEG_SEGMENT_NAME  SEGMENT_TYPE  POP_ST
---------- ---------- ----- ------------- ------------------- ------------- ------
     12608       5913  100% TANEL         CUSTOMERS_NOPART    TABLE         COMPLE
---------- ----------
     12608       5913

CPU Activity Measurement Tools

In addition to the usual suspects (Oracle SQL Monitoring reports and Snapper), I used the awesome Linux tool called perf, but not in the typical way you might have used it in past.

On Linux, perf can be used for profiling code executing on CPUs by sampling the instruction pointer and stack backtraces (perf top), but also for taking snapshots of internal CPU performance counters (perf stat). These CPU performance counters (CPC) tell us what happened inside the CPU during my experiments.

This way we can go way deeper than the high level tools like top utility or getrusage() syscall would ever allow us to go. We’ll be able to measure what physically happened inside the CPU. For example: for how many cycles the CPU core was actually doing useful work pushing instructions further in the execution pipeline vs. was stalled, waiting for requested memory to arrive or for some other internal condition to come true. Also, we can estimate the amount of traffic between the CPU and main memory, plus CPU cache hits/misses at multiple cache levels.

Perf can do CPC snapshotting and accounting also at OS process level. This means you can measure the internal CPU/memory activity of a single OS process under examination and that was great for my experiment.

Note that these kinds of tools are nothing new, they’ve been around with CPU vendor code profilers ever since CPUs were instrumented with performance counters (but undocumented in the early days). Perf stat just makes this stuff easily accessible on Linux. For example, since Solaris 8, you could use cputrack for extracting similar process-level CPU counter “usage”, other platforms have their own tools.

I used the following command (-p specifies the target PID) for measuring internal CPU activity when running my queries:

perf stat -e task-clock,cycles,instructions,branches,branch-misses \
          -e stalled-cycles-frontend,stalled-cycles-backend \
          -e cache-references,cache-misses \
          -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses \
          -p 92106 sleep 30

In RHEL6 equivalents (and later) you can use perf stat -d option for getting similar detailed output without specifying all the counters separately – but I was on OEL5.8. Also, different CPU versions support different performance counters. Read the manuals and start from simpler stuff.

Below is an example output from one test run – where I ran a full table scan against the last column of a regular row-oriented table (all cached in buffer cache) and took a perf stat snapshot of the entire SQL execution. Note that even though the table was cached in Oracle’s in-memory column store, I had disabled its use with the NO_INMEMORY hint, so this full table scan was done entirely via traditional Oracle buffer cache (no physical IOs!):

 Performance counter stats for process id '34783':

      27373.819908 task-clock                #    0.912 CPUs utilized
    86,428,653,040 cycles                    #    3.157 GHz                     [33.33%]
    32,115,412,877 instructions              #    0.37  insns per cycle
                                             #    2.39  stalled cycles per insn [40.00%]
     7,386,220,210 branches                  #  269.828 M/sec                   [39.99%]
        22,056,397 branch-misses             #    0.30% of all branches         [40.00%]
    76,697,049,420 stalled-cycles-frontend   #   88.74% frontend cycles idle    [40.00%]
    58,627,393,395 stalled-cycles-backend    #   67.83% backend  cycles idle    [40.00%]
       256,440,384 cache-references          #    9.368 M/sec                   [26.67%]
       222,036,981 cache-misses              #   86.584 % of all cache refs     [26.66%]
       234,361,189 LLC-loads                 #    8.562 M/sec                   [26.66%]
       218,570,294 LLC-load-misses           #   93.26% of all LL-cache hits    [ 6.67%]
        18,493,582 LLC-stores                #    0.676 M/sec                   [ 6.67%]
         3,233,231 LLC-store-misses          #    0.118 M/sec                   [ 6.67%]
     7,324,946,042 L1-dcache-loads           #  267.589 M/sec                   [13.33%]
       305,276,341 L1-dcache-load-misses     #    4.17% of all L1-dcache hits   [20.00%]
        36,890,302 L1-dcache-prefetches      #    1.348 M/sec                   [26.66%]

      30.000601214 seconds time elapsed

I ran perf for 30 seconds for the above experiment, kicked it running just before executing the Oracle SQL and it finished right after the SQL had completed.

Let’s go through some of the above metrics – top down. I’m explaining these metrics at a fairly high level and in the context of my experiment – fully measuring a single SQL execution in a single Oracle process:

Basic CPU Performance Counter Reference

  1. task-clock (~27373 milliseconds)
    – This is a software event and shows how much the target Linux task (my Oracle process) spent running on CPU during the SQL execution, as far as the OS scheduler knows (roughly 27 seconds on CPU).
    – So while perf took a 30 second snapshot of my process, my test SQL ran in a couple of seconds shorter time (so didn’t run on CPU all 30 seconds). That should explain the “0.912 CPU utilized” derived metric above.
    .
  2. cycles – (86B cycles)
    – This hardware metric shows how many CPU cycles did my process (running a SQL statement) consume during perf runtime.
    – dividing 86B CPU cycles with 27 CPU seconds gives that the CPU core must have operated at around 3.1 GHz speed (on average) during my SQL run.
    – Remember, earlier in this article I used turbostat to show how these 2.93 GHz CPU cores happened to be running at 3.19 GHz frequency thanks to turbo-boost!
    .
  3. instructions – (32B instructions)
    – This hardware metric shows how many instructions the CPU managed to successfully execute (and retire). This is where things get interesting:
    – It’s worth mentioning that modern CPUs are superscalar and pipelined. They have multiple internal execution units, can have multiple instructions (decoded to µops) executed in its pipeline, memory loads & stores happening concurrently and possibly out-of-order – instruction level parallelism, data level parallelism.
    – When dividing 32B executed instructions with 86B CPU cycles we see that we managed to execute only ~0.37 instructions per CPU cycle (IPC) on average!
    – When inverting this number we get 86B/32B=2.69 Cycles Per Instruction (CPI). So, on average, every CPU instruction took ~2.69 CPU cycles to execute! We’ll get to the “why” part later.
    .
  4. branches – (7.3B branches)
    – This hardware metric shows how many branches the execututed code took
    – A branch is basically a jump (unconditional JMP instruction or a conditional jump like JZ, JNZ and many more – this is how basic IF/THEN/ELSE, CASE and various LOOP statmenents work at CPU level).
    – The more decision-points in your code, the more branches it takes.
    – Branches are like speedbumps in a CPU execution pipeline, obstructing the execution flow and prefetching due to the uncertainty of which branch will be taken.
    – That’s why features like branch prediction with speculative execution are built into modern CPUs to alleviate this problem.
    – Knowing that we scanned through roughly 70M rows in this table, this is over 100 branches taken per row scanned (and counted)!
    – Oracle’s traditional block format rows are stored in run-length encoded format, where you know where the following column starts only after reading (and testing) the previous column’s length byte(s). The more columns you need to traverse, the more branches you’ll take per row scanned.
    .
  5. branch-misses – (22M, 0.3% of all branches)
    – This hardware metric shows how many times the CPU branch predictor (described above) mispredicted which branch would be taken and causing a pipeline stall.
    – Correctly predicting branches (where will the code execution jump next?) is good, as this allows to speculatively execute upcoming instructions and prefetch data required by them.
    – However, the branch predictor doesn’t always predict the future correctly, despite various advancements in modern branch prediction, like branch history tables and branch target buffers etc.
    – In a branch misprediction case, the mispredicted branch state has to be discarded, pipeline flushed and the correct branch’es instructions will be fetched & put into the start of execution pipeline (in short: mispredictions waste CPU cycles).
    .
  6. stalled-cycles-frontend – (~76.6M cycles, 88.7% of all cycles)
    – This hardware metric shows for how many cycles the front-end of the CPU were spent stalled, not producing new µops into pipeline for backend execution.
    – The front-end of an Intel pipelined CPU is basically the unit that fetches the good old x86/x86_64 instructions from L1 instruction cache (or RAM if needed), decodes these to RISC-like µops (newer CPUs also cache those µops) and puts these into backend instruction queue for execution.
    – The front-end also deals with remembering taken brances and branch prediction (decoding and sending the predicted branch’es instructions into the backend).
    – The frontend can stall due to various reasons, like instruction cache misses (waiting for memory lines containing instructions to arrive from RAM or a lower level cache), branch mispredictions or simply because the backend can not accept more instructions into its pipeline due to some bottlenecks there.
    .
  7. stalled-cycles-backend – (~58.6M cycles, 67.8% of all cycles)
    – This hardware metric shows how many cycles in the back-end of the CPU were spent stalled instead of advancing the pipeline.
    – The backend of the CPU is where actual computation on data happens – any computation referencing main memory (not only registers) will have to wait until the referenced memory locations have arrived in CPU L1 cache
    – A common reason for backend stalls is due to waiting for a cache line to arrive from RAM (or a lower level cache), although there are many other reasons.
    – To reduce memory-access related stalls, the program should be optimized to do less memory accesses or switch to more compact data structures to avoid loading data it doesn’t need.
    – Simpler, more predictable data structures also help as the CPU hardware prefetcher may detect an “array scan” and start prefetching required memory lines in advance.
    – In the context of this blog series – sequentially scanning and filtering a column of a table’s data is good for reducing memory access related CPU stalls. Walking through random pointers of linked lists (cache buffers chains) and skipping through row pointers in blocks, plus many columns’ length bytes before getting to your single column of interest causes memory access related stalls.
    .
  8. cache-references – (256M references)
    – Now we get into a series of CPU cache traffic related metrics, some of these overlap
    – This metric shows how many Last Level Cache accesses (both read and write) were done.
    – The memory location that CPU tried to access was not in a higher level (L1/L2) cache, thus the lowest cache, Last Level Cache, was checked.
    – Last Level Cache, also called LLC or Lower Level Cache or Longest Latency Cache is usually L3 cache on modern CPUs (although there are some hints that some perf versions still report L2 cache as LLC). I need to read some more perf source code to figure this out, but for this experiment’s purposes it doesn’t matter much if it’s L2 or L3. If I scan through a multi-GB table, it won’t fit into either level cache anyway.
    .
  9. cache-misses – (222M misses)
    – This metric shows how many times the cache reference could not be satisified by the Last Level Cache and therefore RAM access was needed.
    .
  10. LLC-loads – (234M loads)
    – The following 4 metrics just break down the above two in more detail.
    – This metric shows how many times a cache line was requested from LLC as it wasn’t available (or valid) in a higher level cache.
    .
  11. LLC-load-misses – (218M misses)
    – This metric shows how many LLC loads could not be satisfied from the LLC and therefore RAM access was needed.
    .
  12. LLC-stores – (18M stores)
    – This metric shows how many times a cache line was written into a LLC.
    .
  13. LLC-store-misses – (3M misses)
    – This metric shows how many times we had to first read the cache line into LLC before completing the LLC write.
    – This may happen due to partial writes (for example: cache line size is 64 bytes and not currently present in LLC and the CPU tries to write into first 8 bytes of the line).
    – This metric may get incremented due to other cache coherency related reasons where the store fails as other CPU(s) currently own the memory line and have locked and modified it since it was loaded into current CPU cache.
    .
  14. L1-dcache-loads – (7300M loads)
    – The following 3 metrics are similar as above, but for the small (but fast) L1 cache.
    – This metric shows how many times the CPU attempted to load a cache line from L1-data cache into a register.
    – The dcache in the metric name means data accesses from memory (icache means instruction cache – memory lines fetched from L1I cache with instructions for execution).
    – Note how the L1D cache loads metric is way higher than LLC-loads (7300M vs 234M) as many of the repeated tight loops over small internal memory structures can be satisfied from L1 cache.
    .
  15. L1-dcache-load-misses – (305M misses)
    – This metric shows how many data cache loads from L1D cache couldn’t be satisfied from that cache and therefore a next (lower) level cache was needed.
    – If you are wondering how come the L1D cache load misses is much larger than the LLC-loads (305M vs 234M – shouldn’t they be equal), one explanation is that as there’s a L2 cache between L1 & L3, some of the memory accesses got satisfied in L2 cache (and some more explanations illustrating the complexity of CPU cache metrics are here).
    .
  16. L1-dcache-prefetches – (37M prefetches)
    – This metric shows how many cache lines the CPU prefetched as the L1D cache prefetch (DCU prefetcher).
    – Usually this simple prefetcher just fetches the next cache line to the “currently” accessed one.
    – It would be interesting to know if this prefetcher is smart enough to prefetch previous cache lines as regular row-formatted Oracle datablocks are filled from bottom up (this does not apply to the column-oriented stuff).
    – If the full table scan code walks the block’s row directory so that it jumps to the bottom of the block first and works its way upwards, this means that some memory accesses will look like scanning backwards – and may affect prefetching.

 

I hope that this is a useful reference when measuring what’s going on inside a CPU. This is actually pretty basic stuff in the modern CPU world, there’s much more that you can measure in CPUs (via raw performance counters for example) and also different tools that you can use, like Intel VTune. It’s not trivial though, as at such low level even different CPU models by the same vendor may have different meaning (and numbering & flags) for their performance counters.

I won’t pretend to be a CPU & cache coherency expert, however these basic metrics and my understanding looks to be correct enough for comparing different Oracle access paths and storage formats (more about this in next parts of the series).

One bit of warning: It’s the best to run these experiments on a bare-metal server, not in a virtual machine. This is a low-level measurement exercise and in a VM you could suffer from all kinds of additional noise, plus some of the hardware counters would not be available for perf. Some hypervisors do not allow the guest OS to access hardware performance counters by default. One interesting article (by Frits Hoogland) about running perf in VMs is here.

Ok, enough writing for today! I actually started this post more than a month ago and it got way longer than planned. In the next part of this series I will interpret this post’s full-table scan SQL CPU metrics using the above reference (and explain where the bottleneck/inefficiency is). And in Part 4 I’ll  show you all the metrics from a series of experiments – testing memory access efficiency of different Oracle data access paths (indexes vs full table scan vs HCC vs in-memory column store).

Update: I have corrected a couple of typos, we had 86B cpu cycles and 32B instructions instead of 86M/32M as I had mistakenly typed in before. The instructions-per-cycle ratio calculation and the point remains the same though. Thanks to Mark Farnham for letting me know.

Viewing all 87 articles
Browse latest View live