Sunday, November 3, 2013

RAC Concepts - Part : D


Note:  Proofread any scripts before using. Always try scripts on a test instance first. This Blog is not responsible for any damage.

Part - D

41. What happens when ocssd fails, what is node eviction? how does node eviction happens? For all answer will be same.

Oracle RAC: Node evictions & 11gR2 node eviction means restart of cluster stack not reboot of node

Cluster integrity and cluster membership will be governed by occsd (oracle cluster synchronization daemon) monitors the nodes using 2 communication channels:
- Private Interconnect  aka Network Heartbeat
- Voting Disk based communication aka  Disk Heartbeat
Network heartbeat:-
Each node in the cluster is “pinged” every second
  • Nodes must respond in css_misscount time (defaults to 30 secs.)
          – Reducing the css_misscount time is generally not supported
  • Network heartbeat failures will lead to node evictions
  • CSSD-log:
        [date / time] [CSSD][1111902528]
        clssnmPollingThread: node mynodename (5) at 75% heartbeat fatal, removal  in 6.7 sec
Disk Heartbeat:-
Each node in the cluster “pings” (r/w) the Voting Disk(s) every second
  • Nodes must receive a response in (long / short) diskTimeout time
            – IF I/O errors indicate clear accessibility problems  timeout is irrelevant
  • Disk heartbeat failures will lead to node evictions
  • CSSD-log: …
         [CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat:node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1)
Now, we know with above possibilities (network, disk heartbeat failures can lead to node eviction, but sometime when the server/occsd/resource request also makes the node get evicted which are extreme cases)
Why nodes should be evicted?
Evicting (fencing) nodes is a preventive measure (it’s a good thing)!
  • Nodes are evicted to prevent consequences of a split brain:
        – Shared data must not be written by independently operating nodes
        – The easiest way to prevent this is to forcibly remove a node from the cluster
How are nodes evicted? – STONITH Once it is determined that a node needs to be evicted,
  • A “kill request” is sent to the respective node(s)
        – Using all (remaining) communication channels
  • A node (CSSD) is requested to “kill itself” – “STONITH like”
        – “STONITH” foresees that a remote node kills the node to be evicted
    EXAMPLE: Voting Disk Failure
    Voting Disks and heartbeat communication is used to determine the node
  • In a 2 node cluster, the node with the lowest node number should survive
  • In a n-node cluster, the biggest sub-cluster should survive (votes based)
EXAMPLE: Network heartbeat failure
  • The network heartbeat between nodes has failed
          – It is determined which nodes can still talk to each other
          – A “kill request” is sent to the node(s) to be evicted
  • Using all (remaining) communication channels  Voting Disk(s)
  • A node is requested to “kill itself”; executer: typically CSSD
EXAMPLE: What if CSSD is stuck or server itself is not responding?
A node is requested to “kill itself”
  • BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:
  •           – CSSD failed for some reason
             – CSSD is not scheduled within a certain margin
    OCSSDMONITOR (was: oprocd) will take over and execute
EXAMPLE: Cluster member (rac instance) can request a to kill another member (RAC Instance)
A cluster member (rac instance ) can request a kill another member in order to protect the data integrity, in such cases like control file progress record not written proper by the failure instance(read here) , then occsd tries to kill that member, if not possible its tries to evict the node.

11gR2 Changes –> Important, in 11GR2, the fencing (eviction) does not to reboot.
  • Until Oracle Clusterware 11.2.0.2, fencing (eviction) meant “re-boot”
  • With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because:
         – Re-boots affect applications that might run an a node, but are not protected
         – Customer requirement: prevent a reboot, just stop the cluster – implemented…
How does this works?
With Oracle Clusterware 11.2.0.2, re-boots will be seen less: Instead of fast re-booting the node, a graceful shutdown of the cluster stack is attempted
  • It starts with a failure – e.g. network heartbeat or interconnect failure
  • Then IO issuing processes are killed; it is made sure that no IO process remains
         – For a RAC DB mainly the log writer and the database writer are of concern
  • Once all IO issuing processes are killed, remaining processes are stopped
         – IF the check for a successful kill of the IO processes, fails → reboot
  • Once all remaining processes are stopped, the stack stops itself with a “restart flag”
  • OHASD will finally attempt to restart the stack after the graceful shutdown
  •    Exception to above:- 
  • IF the check for a successful kill of the IO processes fails → reboot
  • IF CSSD gets killed during the operation → reboot
  • IF cssdmonitor (oprocd replacement) is not scheduled → reboot
  • IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot


42. What is virtual IP and how does it works?
Why do we have a Virtual IP (VIP) in Oracle RAC 10g or 11g? Why does it just return a dead connection when its primary node fails?

The goal is application availability with the shortest disruption time possible.
When a node fails, the VIP associated with it is automatically failed over to some other node. When this occurs, the following things happen. Vip resources in CRS (ora.node.vip)
  1. VIP detects public network failure which generates a FAN event.  ( A FAN event is an fast application notification event and ONS captures and publishes to the subscriber in RAC the subscriber can be listener)
  2. The new node re-arps (see arpa)  the world indicating a new MAC address for the IP.
  3. Connected clients subscribing to FAN immediately receive ORA-3113 error or equivalent. Those not subscribing to FAN will eventually time out.
  4. New connection requests rapidly traverse the tnsnames.ora address list skipping over the dead nodes, instead of having to wait on TCP-IP timeouts (default 10 mins)
Without using VIPs or FAN, clients connected to a node that died will often wait for a TCP timeout period (which can be up to 10 min) before getting an error. As a result, you don’t really have a good HA solution without using VIPs and FAN. The easiest way to use FAN is to use an integrated client with Fast Connection Failover (FCF) such as JDBC, OCI, or ODP.NET.

43. Describe some RAC wait events?

RAC Wait Event Naming Conventions
Events that start with “GCS%” and “gc%” are Cache Fusion-related waits. In other word, they have to do with waits experienced in shipping current or consistent-read versions of blocks across instances in a RAC cluster.

Events that start with “ges%’ are related to Global Enqueue Services

Block-Oriented RAC Waits
“gc current block 2-way” – Indicates that a current block was sent from either the resource master requiring 1 message and 1 transfer
“gc current block 3-way” – Indicates that a current block was forwarded to a 3rd node from which it was sent, requiring 2 messages and one block transfer.

“gc cr block 2-way” – Indicates that a cr block was sent from either the resource master requiring 1 message and 1 transfer
“gc cr block 3-way” -Indicates that a current block was forwarded to a 3rd node from which it was sent, requiring 2 messages and one block transfer
The block-oriented wait event statistics indicate that a block was received as either the result of a 2-way or a 3-way message, that is, the block was sent from either the resource master requiring 1 message and 1 transfer, or was forwarded to a third node from which it was sent, requiring 2 messages and 1 block transfer.

Message-Oriented RAC Waits
“gc current grant 2-way” – Indicates that no current block was received because it was not cached in any instance. Instead a global grant was given, enabling the requesting instance to read the block from disk or modify it.
“gc cr grant 2-way” – – Indicates that no current block was received because it was not cached in any instance. Instead a global grant was given, enabling the requesting instance to read the block from disk or modify it.
The message-oriented wait event statistics indicate that no block was received because it was not cached in any instance. Instead a global grant was given, enabling the requesting instance to read the block from disk or modify it.

Contention-Oriented RAC Waits
“gc current block busy” – Indicates a local instance made a request for a current version of a block and did not immediately receive the block. This indicates that the block shipping was delayed on the remote instance
“gc cr block busy” – Indicates a local instance made a request for a CR version of a block and did not immediately receive the block
“gc buffer busy acquire” – Indicates that the local instance cannot grant access to data in the local buffer cache because a global operation on the buffer is pending and not yet completed
“gc buffer busy release” – Indicates that a remote instance is waiting to complete access on a block to prep for shipping to a remote instance.
The “gc buffer busy” events mean that there is block contention that is resulting from multiple local requests for the same block, and Oracle must queue these requests.

The contention-oriented wait event statistics indicate that a block was received which was pinned by a session on another node, was deferred because a change had not yet been flushed to disk or because of high concurrency, and therefore could not be shipped immediately. A buffer may also be busy locally when a session has already initiated a cache fusion operation and is waiting for its completion when another session on the same node is trying to read or modify the same data. High service times for blocks exchanged in the global cache may exacerbate the contention, which can be caused by frequent concurrent read and write accesses to the same data.

Load-Oriented Waits
“gc current block congested”
“gc cr block congested”

The load-oriented wait events indicate that a delay in processing has occurred in the GCS, which is usually caused by high load, CPU saturation and would have to be solved by additional CPUs, load-balancing, off loading processing to different times or a new cluster node. For the events mentioned, the wait time encompasses the entire round trip from the time a session starts to wait after initiating a block request until the block arrives.

Scripts to Measure RAC Wait events:

Measuring RAC Waits from V$SESSION_WAIT -

set linesize 200
set lines 300 pages 55

column sid format 9999999
column name format a36
column p1 format 99999999999 Head 'P1'
column program format a25
column p2 format 99999999999 Head 'P2'
column p3 format 99999999999 Head 'P3'
column pgm format a15 head 'What'
column state format a15
column wt format 999999999 head 'Wait|Time'
column WaitEvent format a38 head 'Wait Event'
column lc format 99999999999.99 head 'last call'

select A.sid,
       decode(A.event,'null event','CPU Exec',A.event) WaitEvent,
       A.p1,A.p2,A.p3,
       decode(A.state,'WAITING','WTG','WAITED UNKNOWN TIME','UNK','WAITED SHORT TIME','WST','WAITED KNOWN TIME','WKT') wait_type,
       decode(A.state,'WAITING',A.seconds_in_wait,'WAITED UNKNOWN TIME',-999,'WAITED SHORT TIME',A.wait_time,'WAITED KNOWN TIME',A.WAIT_TIME) wt,
       round((last_call_et/60),2) lc,
       substr(nvl(b.module,b.program),1,15) pgm
from v$session_wait A, v$session B
where 1=1
  and (A.event like 'gc%' or A.event like 'GC%' or A.event like 'ge%')
  and A.event not like '%remote message'
  and A.event not like '%sleep'
  and A.sid=B.sid
  and B.status='ACTIVE'
order by 1
/

Sample output:
930


Measuring RAC Waits from  V$ACTIVE_SESSION_HISTORY -

set lines 300 pages 55

column aas format 99999.99 head 'AAS'
column aas_event format 99999.99 head 'AAS per |Event'
column dbt format 99999999 head 'DB Time'
column event format a30
column time_waited format 99999999 head 'Time (ms)'
column wait_pct format 99999.99 head '% of DB Time'
COLUMN bt NEW_VALUE _bt NOPRINT
COLUMN et NEW_VALUE _et NOPRINT

select min(cast(sample_time as date)) bt, sysdate et from v$active_session_history;

with xtimes (xdate) as
(select to_date('&_bt') xdate from dual
union all
select xdate+(&&interval_mins/1440) from xtimes
where xdate+(&&interval_mins/1440) < sysdate)
select to_char(s1.xdate,'DD-MON-RR HH24:MI:SS') samplestart,
to_char(s1.xdate+(&&interval_mins/1440),'DD-MON-RR HH24:MI:SS') sampleend,
s2.event,.001*((sum(s2.time_waited))) time_waited,
(count(s2.sample_id)/(60*&&interval_mins)) aas_event,
(count(s3.sample_id)) dbt,
round(100*(sum(s2.time_waited)/1000000/count(s3.sample_id)),2) as wait_pct
from xtimes s1,
     v$active_session_history s2,
     v$active_session_history s3
where s2.sample_time between s1.xdate and s1.xdate+(&&interval_mins/1440)
and s3.sample_time between s1.xdate and s1.xdate+(&&interval_mins/1440)
and s2.sample_id=s3.sample_id
and (s2.event  like 'gc%'
or s2.event like 'GC%'
or s2.event like 'ge%')
and s2.event not like '%remote message'
and s2.event not like '%sleep'
group by s1.xdate,s2.event
order by s1.xdate
/
undefine interval_mins

Sample output:
931


Measuring RAC Waits from DBA_HIST_ACTIVE_SESS_HISTORY -

set lines 300 pages 55

column samplestart format a30 head 'Begin Time'
column sampleend format a30 head 'End Time'
column samplestart format a30 head 'Begin Time'
column sampleend format a30 head 'End Time'
column aas format 999.99 head 'AAS'
column aas_event format 999.99 head 'AAS per |Event'
column event format a30
column time_waited format 999999 head 'Time (ms)'
column wait_pct format 999.99 head 'Wait %'
COLUMN bt NEW_VALUE _bt NOPRINT
COLUMN et NEW_VALUE _et NOPRINT

select min(cast(sample_time as date)) bt, sysdate et from v$active_session_history;

with xtimes (xdate) as
(select to_date('&_bt') xdate from dual
union all
select xdate+(&&interval_mins/1440) from xtimes
where xdate+(&&interval_mins/1440) < sysdate)
select to_char(s1.xdate,'DD-MON-RR HH24:MI:SS') samplestart,
to_char(s1.xdate+(&&interval_mins/1440),'DD-MON-RR HH24:MI:SS') sampleend,
s2.event,
.001*((sum(s2.time_waited))) time_waited,
10*(count(s2.sample_id)/(60*&&interval_mins)) aas_event,
10*(count(s3.sample_id)) dbt,
round(100*(sum(s2.time_waited)/1000000/(10*(count(s3.sample_id)))),2) as wait_pct
from xtimes s1,
     dba_hist_active_sess_history s2,
     dba_hist_active_sess_history s3
where s2.sample_time between s1.xdate and s1.xdate+(&&interval_mins/1440)
and s3.sample_time between s1.xdate and s1.xdate+(&&interval_mins/1440)
and s2.sample_id=s3.sample_id
and (s2.event  like 'gc%'
or s2.event like 'GC%'
or s2.event like 'ge%')
and s2.event not like '%remote message'
and s2.event not like '%sleep'
group by s1.xdate,s2.event
order by s1.xdate
/
undefine interval_mins

Sample output:
932

and this table -
image


Added Notes -

1. Viewing Contents in OCR/Voting disks

 There are three possible ways to view the OCR contents.
 a. OCRDUMP (or)
 b. crs_stat -p  (or)
 c. By using strings.
 Voting disk contents are not persistent and are not required to view the contents, because the voting disk contents will be overwritten. If still need to view, strings are used. 
 
2. Verifying Cluster Interconnect

 Cluster interconnects can be verified by:
 i. oifcfg getif
 ii. From AWR Report.
 iii. show parameter cluster_interconnect
 iv. srvctl config network
 
3. Does scan IP required or we can disable it

SCAN IP can be disabled if not required.  However SCAN IP is mandatory during the RAC installation.  
Enabling/disabling SCAN IP is mostly used in oracle apps environment by the concurrent manager (kind of job scheduler in oracle apps). 
  
 To disable the SCAN IP,
 i. Do not use SCAN IP at the client end.
 ii. Stop scan listener
       srvctl stop scan_listener
 iii. Stop scan
       srvctl stop scan (this will stop the scan vip's)
 iv. Disable scan and disable scan listener
      srvctl disable scan
 
4. Migrating to new Diskgroup scenarious

a. Case 1: Migrating disk group from one storage to other with same name 
        1. Consider the disk group is DATA, 
        2. Create new disks in DATA pointing towards the new storage (EMC), 
                  a) Partioning provisioning done by storage and they give you the device name or mapper like /dev/mapper/asakljdlas
        3. Add the new disk to diskgroup DATA
          a) Alter diskgroup data add disk '/dev/mapper/asakljdlas'  
        3. drop the old disks from DATA with which rebalancing is done automatically.  
        If you want you can the rebalance by alter system set asm_power_limit =12 for full throttle.
            alter diskgroup data drop disk 'path to hitachi storage'
            Note: you can get the device name in v$asm_disk in path column.
        4. Request SAN team to detach the old Storage (HITACHI).

b. Case 2: Migrating disk group from one to another with different diskgroup name.
        1) Create the Disk group with new name in the new storage. 
        2) Create the spfile in new diskgroup and change the parameter scope = spfile for control files etc.
        3) Take a control file backup in format +newdiskgroup
        4) Shutdown the db, startup nomount the database
        5) restore the control file from backup (now the control will restore to new diskgroup)
        6) Take the RMAN backup as copy of all the databases with new format. 
               RMAN> backup database as copy format '+newdiskgroup name' ;
        3) RMAN> Switch database to copy. 
        4) Verify dba_data_files,dba_temp_files, v$log that all files are pointing to new diskgroup name.

c. Case 3: Migrating disk group to new storage but no additional diskgroup given
         1) Take the RMAN backup as copy of all the databases with new format and place it in the disk. 
         2) Prepare rename commands from v$log ,v$datafile etc (dynamic queries)
         3) Take a backup of pfile and modify the following referring to new diskgroup name
                  .control_files
  .db_create_file_dest
  .db_create_online_log_dest_1
  .db_create_online_log_dest_2
  .db_recovery_file_des
                          4) stop the database
                          5) Unmount the diskgroup 
                               asmcmd umount ORA_DATA
                          6) use asmcmd renamedg (11gr2 only) command to rename to new diskgroup 
                               renamedg phase=both dgname=ORA_DATA newdgname=NEW_DATA verbose=true  
                 7)  mount the diskgroup
                      asmcmd mount NEW_DATA
                 8) start the database in mount with new pfile taken backup in step 3
                 9) Run the rename file scripts generated at step2
                 9) Add the diskgroup to cluster the cluster (if using rac)
   srvctl modify database -d orcl -p +NEW_FRA/orcl/spfileorcl.ora 
   srvctl modify database -d orcl -a "NEW_DATA" 
   srvctl config database -d orcl   
   srvctl start database -d orcl 
                                        10) Delete the old diskgroup from cluster
                        crsctl delete resource ora.ORA_DATA.dg 
                11) Open the database.
 
5. Database rename in RAC, what could be the checklist for you?

 a. Take the outputs of all the services that are running on the databases.
 b. set cluster_database=FALSE
 c. Drop all the services associated with the database.
 d. Stop the database
 e. Startup mount
 f. Use nid to change the DB Name. 
    Generic question, If using ASM the usual location for the datafile would be +DATA/datafile/OLDDBNAME/system01.dbf'
    Does NID changes this path too? to reflect the new db name?
    Yes it will, by using proper directory structure it will create a links to original directory structure. +DATA/datafile/NEWDBNAME/system01.dbf'
    this has to be tested,  We dont have test bed, but thanks to Anji who confirmed it will

 g. Change the parameters according to the new database name
 h. Change the password file.
 i. Stop the database.
 j. Mount the database
 k. Open database with Reset logs
 l. Create spfile from pfile.
 m. Add database to the cluster.
 n. Create the services that are dropped in prior to rename.
 o. Bounce the database.
 
6. How to find the database in which particular service is attached to when you have a large number of databases running in the server, you cannot check one by one manually
Write a shell script to read the database name from oratab and iterate the loop taking inpt as DB name in srvctl to get the result.
 

#!/bin/ksh
ORACLE_HOME=
PATH=$ORACLE_HOME/bin:$PATH
LD_LIBRARY_PATH=${SAVE_LLP}:${ORACLE_HOME}/lib
export TNS_ADMIN ORACLE_HOME PATH LD_LIBRARY_PATH
for INSTANCE in `cat /etc/oratab|grep -v "^#"|cut -f1 -d: -s`
do
export ORACLE_SID=$INSTANCE
echo `srvctl status service -d $INSTANCE -s $1| grep -i "is running"`
done

 
7. Difference between OHAS and CRS
OHAS is complete cluster stack which includes some kernel level tasks like managing network,time synchronization, disks etc, where the CRS has the ability to manage the resources like database,listeners,applications, etc With both of this Oracle provides the high availability clustering services rather only affinity to databases.



No comments:

Post a Comment

Oracle ASM Concepts

Note: Proofread any scripts before using. Always try scripts on a test instance first. This Blog is not responsible for any damage. O...