We had two SRX210 in a HA cluster, that both accidentely lost power, and when they came up again, the both had lost ALL setup, (similar to a factory reset!), this was very strange and took us som time to fix. (this was a month ago)
Later (this week), another SRX100 on a totally different network (different continent also), died with similar problem… when I looked at the setup, it was all totally gone. had to recover from a backup.
have anyone seen such a problem? this was running on quite an old firmware 10.x is it a bug? attack? hardware problem?
Update: the SRX100 was replaced with a backup device (also SRX100), and the faulty device was upgraded to latest stable firmware, and loaded with the same-ish config as before. then it was setup in a test network, and stress-tested for a few days.. and this weekend it died again. (was showing RED light on STATUS, and no traffic passing trough it). the console window on serial was on the whole time and this is the content.
U-Boot 1.1.6-JNPR-2.7 (Build time: Nov 26 2013 - 19:04:49)
Initializing memory this may take some time...
Measured DDR clock 266.62 MHz
SRX_100_LOWMEM board revision major:0, minor:0, serial #: AT0112AF1168
OCTEON CN5020-SCP pass 1.1, Core clock: 500 MHz, DDR clock: 266 MHz (532 Mhz d)
DRAM: 512 MB
Starting Memory POST...
Checking datalines... OK
Checking address lines... OK
Checking 512K memory for U-Boot... OK.
Running U-Boot CRC Test... OK.
Flash: 4 MB
USB: scanning bus for devices... 4 USB Device(s) found
scanning bus for storage devices... 2 Storage Device(s) found
Clearing DRAM....... done
BIST check passed.
Boot Media: nand-flash usb
Net: pic init done (err = 0)octeth0
POST Passed
Press SPACE to abort autoboot in 1 seconds
ELF file is 32 bit
Loading .text @ 0x8f0000a0 (246560 bytes)
Loading .rodata @ 0x8f03c3c0 (14144 bytes)
Loading .reginfo @ 0x8f03fb00 (24 bytes)
Loading .rodata.str1.4 @ 0x8f03fb18 (16516 bytes)
Loading set_Xcommand_set @ 0x8f043b9c (96 bytes)
Loading .rodata.cst4 @ 0x8f043bfc (20 bytes)
Loading .data @ 0x8f044000 (5744 bytes)
Loading .data.rel.ro @ 0x8f045670 (120 bytes)
Loading .data.rel @ 0x8f0456e8 (136 bytes)
Clearing .bss @ 0x8f045770 (11600 bytes)
## Starting application at 0x8f0000a0 ...
Consoles: U-Boot console
Found compatible API, ver. 2.7
FreeBSD/MIPS U-Boot bootstrap loader, Revision 2.7
(ccheng@svl-junos-d081.juniper.net, Tue Nov 26 19:05:43 PST 2013)
Memory: 512MB
[0]Booting from nand-flash slice 2
Un-Protected 1 sectors
writing to flash...
Protected 1 sectors
Loading /boot/defaults/loader.conf
/kernel data=0xb0496c+0x1344a4 syms=[0x4+0x8a9e0+0x4+0xc8f47]
Hit [Enter] to boot immediately, or space bar for command prompt.
Booting [/kernel]...
Kernel entry at 0x801000e0 ...
init regular console
Primary ICache: Sets 64 Size 128 Asso 4
Primary DCache: Sets 1 Size 128 Asso 64
Secondary DCache: Sets 128 Size 128 Asso 8
GDB: debug ports: uart
GDB: current port: uart
KDB: debugger backends: ddb gdb
KDB: current backend: ddb
kld_map_v: 0x8ff80000, kld_map_p: 0x0
Copyright (c) 1996-2014, Juniper Networks, Inc.
All rights reserved.
Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
JUNOS 12.1X44-D35.5 #0: 2014-05-19 21:36:43 UTC
builder@dagmath.juniper.net:/volume/build/junos/12.1/service/12.1X44-D35.5l
JUNOS 12.1X44-D35.5 #0: 2014-05-19 21:36:43 UTC
builder@dagmath.juniper.net:/volume/build/junos/12.1/service/12.1X44-D35.5l
real memory = 536870912 (512MB)
avail memory = 304193536 (290MB)
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
Security policy loaded: JUNOS MAC/pcap (mac_pcap)
Security policy loaded: JUNOS MAC/runasnonroot (mac_runasnonroot)
netisr_init: !debug_mpsafenet, forcing maxthreads from 2 to 1
cpu0 on motherboard
: CAVIUM's OCTEON 5020 CPU Rev. 0.1 with no FPU implemented
L1 Cache: I size 32kb(128 line), D size 8kb(128 line), sixty four way.
L2 Cache: Size 128kb, 8 way
obio0 on motherboard
uart0: <Octeon-16550 channel 0> on obio0
uart0: console (9600,n,8,1)
twsi0 on obio0
dwc0: <Synopsis DWC OTG Controller Driver> on obio0
usb0: <USB Bus for DWC OTG Controller> on dwc0
usb0: USB revision 2.0
uhub0: vendor 0x0000 DWC OTG root hub, class 9/0, rev 2.00/1.00, addr 1
uhub0: 1 port with 1 removable, self powered
uhub1: vendor 0x0409 product 0x005a, class 9/0, rev 2.00/1.00, addr 2
uhub1: single transaction translator
uhub1: 2 ports with 1 removable, self powered
umass0: STMicroelectronics ST72682 High Speed Mode, rev 2.00/2.10, addr 3
umass1: Kingston DT 101 G2, rev 2.00/1.00, addr 4
cpld0 on obio0
pcib0: <Cavium on-chip PCI bridge> on obio0
Disabling Octeon big bar support
PCI Status: PCI 32-bit: 0xc041b
pcib0: Initialized controller
pci0: <PCI bus> on pcib0
pci0: <serial bus, USB> at device 2.0 (no driver attached)
pci0: <serial bus, USB> at device 2.1 (no driver attached)
pci0: <serial bus, USB> at device 2.2 (no driver attached)
gblmem0 on obio0
octpkt0: <Octeon RGMII> on obio0
cfi0: <AMD/Fujitsu - 4MB> on obio0
Timecounter "mips" frequency 500000000 Hz quality 0
###PCB Group initialized for udppcbgroup
###PCB Group initialized for tcppcbgroup
da1 at umass-sim1 bus 1 target 0 lun 0
da1: <Kingston DT 101 G2 PMAP> Removable Direct Access SCSI-0 device
da1: 40.000MB/s transfers
da1: 15304MB (31342592 512 byte sectors: 255H 63S/T 1950C)
da0 at umass-sim0 bus 0 target 0 lun 0
da0: <ST ST72682 2.10> Removable Direct Access SCSI-2 device
da0: 40.000MB/s transfers
da0: 1000MB (2048000 512 byte sectors: 64H 32S/T 1000C)
Trying to mount root from ufs:/dev/da0s2a
WARNING: / was not properly dismounted
Attaching /cf/packages/junos via /dev/mdctl...
Mounted junos package on /dev/md0...
Media check on da0
Automatic reboot in progress...
** /dev/da0s2a
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
142 files, 75006 used, 75032 free (32 frags, 9375 blocks, 0.0% fragmentation)
***** FILE SYSTEM MARKED CLEAN *****
Verified junos signed by PackageProduction_12_1_0
Verified jboot signed by PackageProduction_12_1_0
Ignoring watchdog timeout during boot/reboot
veriexec: cannot verify /packages/junos-12.1X44-D35.5-domestic.sig: ERROR: Faic
** /dev/bo0s3e
** Last Mounted on /config
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
19 files, 50 used, 12388 free (36 frags, 1544 blocks, 0.3% fragmentation)
***** FILE SYSTEM MARKED CLEAN *****
** /dev/bo0s3f
** Last Mounted on /cf/var
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? yes
SUMMARY INFORMATION BAD
SALVAGE? yes
BLK(S) MISSING IN BIT MAPS
SALVAGE? yes
637 files, 10808 used, 164510 free (254 frags, 20532 blocks, 0.1% fragmentatio)
***** FILE SYSTEM MARKED CLEAN *****
***** FILE SYSTEM WAS MODIFIED *****
Loading configuration ...
vn_read_compressed: inflate of bytepos 86966272, offset in file = 51491159, er}
panic: bad inflate
cpuid = 0
KDB: stack backtrace:
SP 0: not in kernel
uart_z8530_class+0x0 (0,0,0,0) ra 0 sz 0
pid 54, process: md0
###Entering boot mastership relinquish phase
KDB: enter: panic
[thread pid 54 tid 100048 ]
Stopped at breakpoint+0x4: jr ra
db>
Please note the following:
- a USB drive was in the device
- a Serial cable was in the device
- The device was NOT on UPS, and power unstability might have occured.
- 4 different network were created and 3 of these were monitored
hope someone can shed some light on what might have gone wrong.
Update2:
Pressing the power buttong did nothing, but pressing, and holding it for 6+ seconds turned the switch OFF. and when I turned it ON again, it loaded the config as normal. so, the device was NOT wiped this time, unlike the initial time.
Best Answer
It sounds like a couple of different issues are going on here...
Older versions of JunOS were known to corrupt things during a power failure. Remember that JunOS is based on FreeBSD, and so there was an implicit assumption that you would do a proper shutdown before killing power.
To mitigate this, JunOS has a rescue config. If the regular config is corrupt/unreadable, it would load the rescue config instead. Did you have your rescue config set? If not, you should have. My best practice is to update the rescue config on production systems after any config changes are finalized/tested/approved. This could explain the "revert to defaults" issue you had with the SRX210. (The second possibility is that your cluster was unhealthy and configs were not syncing between the nodes as expected. See the commands here to verify the cluster is working.)
Additionally, it was possible to actually corrupt the root filesystem of an older JunOS device, and it would fail to boot, at all, and would fail out to a
db>
prompt for debugging.In newer versions of JunOS, the concerns about corruption due to power failure are pretty much resolved. The addition of Resilient Dual Root Partitions helped a lot. Note that if you are upgrading from a much older version that predates these features, there are some extra boot loader / partition table changes that are required during the upgrade process. http://www.juniper.net/techpubs/en_US/junos11.4/information-products/topic-collections/security/software-all/initial-config/index.html?topic-56813.html
Make sure you are running the latest recommended JunOS build on both SRXes. If you upgraded from a much older version, make sure you followed the instructions and did not skip a step as you may miss out on the dual partition features.
The failure you saw on the SRX100 looks like a root filesystem corruption issue (
panic: bad inflate
is a big clue). However, it looks like you upgraded to a new enough JunOS version that a corrupt root FS should never happen. Also, if you rebooted and it magically started working again, that smells like the builtin flash storage is dying. I would open a ticket with JTAC for a replacement or buy a new one.