一次NBU备份错误诊断

mac2022-06-30 35

在对系统进行例行检查的时候，发现日常备份失败。

错误信息为：

RMAN> backup incremental level 0 database;

Starting backup at 10-MAR-08using target database controlfile instead of recovery catalogallocated channel: ORA_SBT_TAPE_1channel ORA_SBT_TAPE_1: sid=120 devtype=SBT_TAPEchannel ORA_SBT_TAPE_1: VERITAS NetBackup for Oracle - Release 5.0GA (2003103006)channel ORA_SBT_TAPE_1: starting incremental level 0 datafile backupsetchannel ORA_SBT_TAPE_1: specifying datafile(s) in backupsetinput datafile fno=00001 name=/dev/vx/rdsk/maindbdg/lv_main00input datafile fno=00008 name=/opt/oracle/oradata/oradata/bjdb01/users01.dbfinput datafile fno=00039 name=/opt/oracle/oradata/oradata/bjdb01/xdb02.dbfinput datafile fno=00009 name=/opt/oracle/oradata/oradata/bjdb01/xdb01.dbfinput datafile fno=00003 name=/opt/oracle/oradata/oradata/bjdb01/cwmlite01.dbfinput datafile fno=00004 name=/opt/oracle/oradata/oradata/bjdb01/drsys01.dbfinput datafile fno=00006 name=/opt/oracle/oradata/oradata/bjdb01/odm01.dbfinput datafile fno=00007 name=/opt/oracle/oradata/oradata/bjdb01/tools01.dbfchannel ORA_SBT_TAPE_1: starting piece 1 at 10-MAR-08RMAN-00571: ===========================================================RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============RMAN-00571: ===========================================================RMAN-03009: failure of backup command on ORA_SBT_TAPE_1 channel at 03/10/2008 11:31:12ORA-19506: failed to create sequential file, name="tpjatl1b_1_1", parms=""ORA-27028: skgfqcre: sbtbackup returned errorORA-19511: Error received from media manager layer, error text:VxBSACreateObject: Failed with error:Server Status: unable to allocate new media for backup, storage unit has none available

从这个错误信息上看似乎是空间不足造成的。不过虽然的备份错误信息变为：

RMAN-00571: ===========================================================RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============RMAN-00571: ===========================================================RMAN-03009: failure of backup command on ch00 channel at 03/10/2008 05:14:15ORA-19502: write error on file "bk_26552_1_648968690", blockno 664577 (blocksize=512)ORA-27030: skgfwrt: sbtwrite2 returned errorORA-19511: Error received from media manager layer, error text:VxBSASendData: Failed with error:Server Status: Communication with the server has not been iniatated or the server status has not been retrieved from the server.

从这个错误上看，就不只是空间的问题了。

通过图形界面jnbSA，发现很多管理选项点击后反应很慢，基本上出不来结果。于是采用bpadm从命令行方式进行查询，从REPORT的PROBLEM中查询到下面的信息：

03/11/2008 01:45:04 backupcenter240 bpexpdate Could not build host list: client hostname could not be found03/11/2008 02:13:34 backupcenter240 bjdb01 cannot write image to media id 000013, drive index 0, I/O错误03/11/2008 02:13:48 backupcenter240 bjdb01 backup by oracle on client bjdb01 using policy oracle: media write error03/11/2008 02:14:04 backupcenter240 bjdb01 backup of client bjdb01 exited with status 6 (the backup failed to back up the requested files)03/11/2008 02:22:58 backupcenter240 bjdb01 cannot write image to media id 000013, drive index 0, I/O错误03/11/2008 02:23:12 backupcenter240 bjdb01 backup by oracle on client bjdb01 using policy oracle: media write error03/11/2008 02:23:19 backupcenter240 bjdb01 suspending further backup attempts for client bjdb01, policy oracle, schedule Cumulative-Inc because it has exceeded the configured number of tries03/11/2008 02:23:19 backupcenter240 bjdb01 backup of client bjdb01 exited with status 6 (the backup failed to back up the requested files)03/11/2008 02:23:20 backupcenter240 - scheduler exiting - the backup failed to back up the requested files (6)03/11/2008 09:32:42 backupcenter240 data03 cannot write image to media id 000016, drive index 0, I/O错误03/11/2008 09:32:53 backupcenter240 data03 DOWN'ing drive index 0, it has had at least 3 errors in last 12 hour(s)03/11/2008 09:32:55 backupcenter240 data03 backup by oracle on client data03 using policy bjdb03-ora: media write error03/11/2008 09:33:02 backupcenter240 data03 backup of client data03 exited with status 6 (the backup failed to back up the requested files)03/11/2008 10:48:34 backupcenter240 data03 media manager terminated during mount of media id 000016, possible media mount timeout03/11/2008 10:48:36 backupcenter240 data03 media manager terminated by parent process03/11/2008 10:48:37 backupcenter240 data03 backup by oracle on client data03 using policy bjdb03-ora: the backup failed to back up the requested files03/11/2008 10:48:38 backupcenter240 data03 suspending further backup attempts for client data03, policy bjdb03-ora, schedule diff because it has exceeded the configured number of tries03/11/2008 10:48:38 backupcenter240 data03 backup of client data03 exited with status 6 (the backup failed to back up the requested files)03/11/2008 13:55:03 backupcenter240 bpexpdate Could not build host list: client hostname could not be found

进一步查询详细的log信息，发现存在大量的错误：

03/11/2008 18:23:59 backupcenter240 - cleaning job DB03/11/2008 18:23:59 backupcenter240 - all drives are down for the specified robot number = 0, robot type = TLD and density = hcart03/11/2008 18:23:59 backupcenter240 - no drives up on storage unit <backupcenter240-hcart-robot-tld-0>03/11/2008 18:24:00 bjdb01 - all drives are down for the specified robot number = 0, robot type = TLD and density = hcart03/11/2008 18:24:00 backupcenter240 - no drives up on storage unit <bjdb01-hcart-robot-tld-0>03/11/2008 18:24:31 backupcenter240 - all drives are down for the specified robot number = 0, robot type = TLD and density = hcart03/11/2008 18:24:31 backupcenter240 - no drives up on storage unit <unit_99>03/11/2008 18:24:32 backupcenter240 - all drives are down for the specified robot number = 0, robot type = TLD and density = hcart03/11/2008 18:24:32 backupcenter240 - no drives up on storage unit <unit_data>03/11/2008 18:24:32 backupcenter240 data03 skipping backup of client data03, policy bjdb03-ora, schedule diff because it has exceeded the configured number of tries

从这个信息上看，似乎是机械手出现了问题。而且如果真的是机械手的问题，那么也可以解释前后两次备份错误信息的不同。当一个磁带备份满了之后，机械手尝试更换新的磁带，这时出现了故障，而对于当时备份的操作，就出现了无法写入的错误，报错没有足够空间。而随后的备份由于机械手故障，而导致没有可用的磁带可以写入，因此报错NETBACKUP没有初始化完成。

继续检查media的报告，在汇总信息中看到：

Number of ACTIVE media that, as of now:There are no ACTIVE media present in the media database

这进一步确定了刚才的判断，机械手故障导致可用的磁带无法放到驱动器中，因此系统中没有可用的介质。

通过tpconfig检查机械手的状态：

Index DriveName DrivePath Type Shared Status***** ********* ********** **** ****** ******0 IBMULTRIUM-TD10 /dev/rmt/1cbn hcart Yes DOWNTLD(0) Definition DRIVE=1

Currently defined robotics are:TLD(0) robotic path = /dev/sg/c2t4l1,volume database host = backupcenter240

机械手处于DOWN的状态，看来问题已经基本确定了。

尝试使用robtest检查机械手：

bash-2.03# robtest Configured robots with local control supporting test utilities:TLD(0) robotic path = /dev/sg/c2t4l1

Robot Selection---------------1) TLD 02) none/quitEnter choice: 1

Robot selected: TLD(0) robotic path = /dev/sg/c2t4l1

Invoking robotic test utility:/usr/openv/volmgr/bin/tldtest -r /dev/sg/c2t4l1 -d1 /dev/rmt/1cbn

Opening /dev/sg/c2t4l1MODE_SENSE completeEnter tld commands (? returns help information)?

To exit the utility, type q or Q.

init - Initialize element statusinitrange <d#|s#|p#|t> [#]- Init element status rangeallow - Allow media removalprevent - Prevent media removalextend - Extend media access portretract - Retract media access portmode - Mode sensem <from> <to> - Move mediumpos <to> - Position to drive or slots [d|p|t|s [n]] [raw] - Read element statusinquiry - Display vendor and product IDrezero - Rezero unitinport - Ready inport (media access port)debug - Toggle debug mode for this utilitytest_ready - Send a TEST UNIT READY to the device

<from> <to> specifies drive (d#), slot (s#), media access port (p#),or transport (t#)<d#|s#|p#|t#> is drive #, slot #, media access port #, or transport #[#] is number of elements for d, s, p, or tNOTE - drive # is 1 - Number of drivesslot # is 1 - Number of slotsmedia access port # is 1 - Number of media access port elementstransport # is 1 - Number of transports<type> = (d)rive, (s)lot, media access (p)ort, or (t)ransport

unload <drive> - Issue SCSI unload<drive> = d1 or 1, d2 or 2, d3 or 3 ... d648 or 648

inquiryInquiry_data: STK L40 0213test_readyUnit is readyq

Robot Selection---------------1) TLD 02) none/quitEnter choice:

尝试发出test_ready命令，等待一段时间后，发现机械手状态已经恢复正常：

Index DriveName DrivePath Type Shared Status***** ********* ********** **** ****** ******0 IBMULTRIUM-TD10 /dev/rmt/1cbn hcart Yes UPTLD(0) Definition DRIVE=1

Currently defined robotics are:TLD(0) robotic path = /dev/sg/c2t4l1,volume database host = backupcenter240

下面尝试备份：

$ rman target /

Recovery Manager: Release 9.2.0.4.0 - 64bit Production

connected to target database: BJDB01 (DBID=3255963758)

RMAN> backup current controlfile;

Starting backup at 11-MAR-08using target database controlfile instead of recovery catalogallocated channel: ORA_SBT_TAPE_1channel ORA_SBT_TAPE_1: sid=19 devtype=SBT_TAPEchannel ORA_SBT_TAPE_1: VERITAS NetBackup for Oracle - Release 5.0GA (2003103006)channel ORA_SBT_TAPE_1: starting full datafile backupsetchannel ORA_SBT_TAPE_1: specifying datafile(s) in backupsetincluding current controlfile in backupsetchannel ORA_SBT_TAPE_1: starting piece 1 at 11-MAR-08channel ORA_SBT_TAPE_1: finished piece 1 at 11-MAR-08piece handle=ttjb17ur_1_1 comment=API Version 2.0,MMS Version 5.0.0.0channel ORA_SBT_TAPE_1: backup set complete, elapsed time: 00:04:56Finished backup at 11-MAR-08

Starting Control File Autobackup at 11-MAR-08piece handle=c-3255963758-20080311-00 comment=API Version 2.0,MMS Version 5.0.0.0Finished Control File Autobackup at 11-MAR-08

尝试备份终于成功。

可惜的是，备份小的文件似乎没有问题，一旦备份文件比较大的时候，仍然出现上面的错误信息：

而且后台日志出现大量的IO错误信息：

03/12/2008 09:42:51 backupcenter240 bjdb01 cannot write image to media id 000016, drive index 0, I/O错误 03/12/2008 09:42:51 backupcenter240 bjdb01 FREEZING media id 000016, it has had at least 3 errors in the last 12 hour(s) 03/12/2008 09:43:08 backupcenter240 bjdb01 CLIENT bjdb01 POLICY oracle SCHED Default-Application-Backup EXIT STATUS 84 (media write error)03/12/2008 09:43:08 backupcenter240 bjdb01 backup by oracle on client bjdb01: media write error

看来现在不仅仅是软件问题了，经过供应商最后确认，是带库的读写头出现问题，最终通过更换配件，解决了这个问题。

转载于:https://www.cnblogs.com/myitworld/archive/2008/04/22/2214883.html

最新回复(0)