Juniper SRX100 and SRX210 goes to factory reset for no reason

juniper-srx

We had two SRX210 in a HA cluster, that both accidentely lost power, and when they came up again, the both had lost ALL setup, (similar to a factory reset!), this was very strange and took us som time to fix. (this was a month ago)

Later (this week), another SRX100 on a totally different network (different continent also), died with similar problem… when I looked at the setup, it was all totally gone. had to recover from a backup.

have anyone seen such a problem? this was running on quite an old firmware 10.x is it a bug? attack? hardware problem?

Update: the SRX100 was replaced with a backup device (also SRX100), and the faulty device was upgraded to latest stable firmware, and loaded with the same-ish config as before. then it was setup in a test network, and stress-tested for a few days.. and this weekend it died again. (was showing RED light on STATUS, and no traffic passing trough it). the console window on serial was on the whole time and this is the content.

U-Boot 1.1.6-JNPR-2.7 (Build time: Nov 26 2013 - 19:04:49)                     

Initializing memory this may take some time...                             
Measured DDR clock 266.62 MHz                                                  
SRX_100_LOWMEM board revision major:0, minor:0, serial #: AT0112AF1168         
OCTEON CN5020-SCP pass 1.1, Core clock: 500 MHz, DDR clock: 266 MHz (532 Mhz d)
DRAM:  512 MB                                                                  
Starting Memory POST...                                                        
Checking datalines... OK                                                       
Checking address lines... OK                                                   
Checking 512K memory for U-Boot... OK.                                         
Running U-Boot CRC Test... OK.                                                 
Flash:  4 MB                                                                   
USB:   scanning bus for devices... 4 USB Device(s) found                       
       scanning bus for storage devices... 2 Storage Device(s) found           
Clearing DRAM....... done                                                      
BIST check passed.                                                             
Boot Media: nand-flash usb                                                     
Net:   pic init done (err = 0)octeth0                                          
POST Passed                                                                    
Press SPACE to abort autoboot in 1 seconds                                     
ELF file is 32 bit                                                             
Loading .text @ 0x8f0000a0 (246560 bytes)                                      
Loading .rodata @ 0x8f03c3c0 (14144 bytes)                                     
Loading .reginfo @ 0x8f03fb00 (24 bytes)                                       
Loading .rodata.str1.4 @ 0x8f03fb18 (16516 bytes)                              
Loading set_Xcommand_set @ 0x8f043b9c (96 bytes)                               
Loading .rodata.cst4 @ 0x8f043bfc (20 bytes)                                   
Loading .data @ 0x8f044000 (5744 bytes)                                        
Loading .data.rel.ro @ 0x8f045670 (120 bytes)                                  
Loading .data.rel @ 0x8f0456e8 (136 bytes)                                     
Clearing .bss @ 0x8f045770 (11600 bytes)                                       
## Starting application at 0x8f0000a0 ...                                      
Consoles: U-Boot console                                                       
Found compatible API, ver. 2.7                                                 

FreeBSD/MIPS U-Boot bootstrap loader, Revision 2.7                             
(ccheng@svl-junos-d081.juniper.net, Tue Nov 26 19:05:43 PST 2013)              
Memory: 512MB                                                                  
[0]Booting from nand-flash slice 2                                             
Un-Protected 1 sectors                                                         
writing to flash...                                                            
Protected 1 sectors                                                            
Loading /boot/defaults/loader.conf                                             
/kernel data=0xb0496c+0x1344a4 syms=[0x4+0x8a9e0+0x4+0xc8f47]                  


Hit [Enter] to boot immediately, or space bar for command prompt.              
Booting [/kernel]...                                                           
Kernel entry at 0x801000e0 ...                                                 
init regular console                                                           
Primary ICache: Sets 64 Size 128 Asso 4                                        
Primary DCache: Sets 1 Size 128 Asso 64                                        
Secondary DCache: Sets 128 Size 128 Asso 8                                     
GDB: debug ports: uart                                                         
GDB: current port: uart                                                        
KDB: debugger backends: ddb gdb                                                
KDB: current backend: ddb                                                      
kld_map_v: 0x8ff80000, kld_map_p: 0x0                                          
Copyright (c) 1996-2014, Juniper Networks, Inc.                                
All rights reserved.                                                           
Copyright (c) 1992-2006 The FreeBSD Project.                                   
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994       
        The Regents of the University of California. All rights reserved.      
JUNOS 12.1X44-D35.5 #0: 2014-05-19 21:36:43 UTC                                
    builder@dagmath.juniper.net:/volume/build/junos/12.1/service/12.1X44-D35.5l
JUNOS 12.1X44-D35.5 #0: 2014-05-19 21:36:43 UTC                                
    builder@dagmath.juniper.net:/volume/build/junos/12.1/service/12.1X44-D35.5l
real memory  = 536870912 (512MB)                                               
avail memory = 304193536 (290MB)                                               
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs                            
Security policy loaded: JUNOS MAC/pcap (mac_pcap)                              
Security policy loaded: JUNOS MAC/runasnonroot (mac_runasnonroot)              
netisr_init: !debug_mpsafenet, forcing maxthreads from 2 to 1                  
cpu0 on motherboard                                                            
: CAVIUM's OCTEON 5020 CPU Rev. 0.1 with no FPU implemented                    
        L1 Cache: I size 32kb(128 line), D size 8kb(128 line), sixty four way. 
        L2 Cache: Size 128kb, 8 way                                            
obio0 on motherboard                                                           
uart0: <Octeon-16550 channel 0> on obio0                                       
uart0: console (9600,n,8,1)                                                    
twsi0 on obio0                                                                 
dwc0: <Synopsis DWC OTG Controller Driver> on obio0                            
usb0: <USB Bus for DWC OTG Controller> on dwc0                                 
usb0: USB revision 2.0                                                         
uhub0: vendor 0x0000 DWC OTG root hub, class 9/0, rev 2.00/1.00, addr 1        
uhub0: 1 port with 1 removable, self powered                                   
uhub1: vendor 0x0409 product 0x005a, class 9/0, rev 2.00/1.00, addr 2          
uhub1: single transaction translator                                           
uhub1: 2 ports with 1 removable, self powered                                  
umass0: STMicroelectronics ST72682  High Speed Mode, rev 2.00/2.10, addr 3     
umass1: Kingston DT 101 G2, rev 2.00/1.00, addr 4                              
cpld0 on obio0                                                                 
pcib0: <Cavium on-chip PCI bridge> on obio0                                    
Disabling Octeon big bar support                                               
PCI Status: PCI 32-bit: 0xc041b                                                
pcib0: Initialized controller                                                  
pci0: <PCI bus> on pcib0                                                       
pci0: <serial bus, USB> at device 2.0 (no driver attached)                     
pci0: <serial bus, USB> at device 2.1 (no driver attached)                     
pci0: <serial bus, USB> at device 2.2 (no driver attached)                     
gblmem0 on obio0                                                               
octpkt0: <Octeon RGMII> on obio0                                               
cfi0: <AMD/Fujitsu - 4MB> on obio0                                             
Timecounter "mips" frequency 500000000 Hz quality 0                            
###PCB Group initialized for udppcbgroup                                       
###PCB Group initialized for tcppcbgroup                                       
da1 at umass-sim1 bus 1 target 0 lun 0                                         
da1: <Kingston DT 101 G2 PMAP> Removable Direct Access SCSI-0 device           
da1: 40.000MB/s transfers                                                      
da1: 15304MB (31342592 512 byte sectors: 255H 63S/T 1950C)                     
da0 at umass-sim0 bus 0 target 0 lun 0                                         
da0: <ST ST72682 2.10> Removable Direct Access SCSI-2 device                   
da0: 40.000MB/s transfers                                                      
da0: 1000MB (2048000 512 byte sectors: 64H 32S/T 1000C)                        
Trying to mount root from ufs:/dev/da0s2a                                      
WARNING: / was not properly dismounted                                         
Attaching /cf/packages/junos via /dev/mdctl...                                 
Mounted junos package on /dev/md0...                                           

Media check on da0                                                             
Automatic reboot in progress...                                                
** /dev/da0s2a                                                                 
** Last Mounted on /                                                           
** Root file system                                                            
** Phase 1 - Check Blocks and Sizes                                            
** Phase 2 - Check Pathnames                                                   
** Phase 3 - Check Connectivity                                                
** Phase 4 - Check Reference Counts                                            
** Phase 5 - Check Cyl groups                                                  
142 files, 75006 used, 75032 free (32 frags, 9375 blocks, 0.0% fragmentation)  

***** FILE SYSTEM MARKED CLEAN *****                                           
Verified junos signed by PackageProduction_12_1_0                              
Verified jboot signed by PackageProduction_12_1_0                              
Ignoring watchdog timeout during boot/reboot                                   
veriexec: cannot verify /packages/junos-12.1X44-D35.5-domestic.sig: ERROR: Faic
** /dev/bo0s3e                                                                 
** Last Mounted on /config                                                     
** Phase 1 - Check Blocks and Sizes                                            
** Phase 2 - Check Pathnames                                                   
** Phase 3 - Check Connectivity                                                
** Phase 4 - Check Reference Counts                                            
** Phase 5 - Check Cyl groups                                                  
19 files, 50 used, 12388 free (36 frags, 1544 blocks, 0.3% fragmentation)      

***** FILE SYSTEM MARKED CLEAN *****                                           
** /dev/bo0s3f                                                                 
** Last Mounted on /cf/var                                                     
** Phase 1 - Check Blocks and Sizes                                            
** Phase 2 - Check Pathnames                                                   
** Phase 3 - Check Connectivity                                                
** Phase 4 - Check Reference Counts                                            
** Phase 5 - Check Cyl groups                                                  
FREE BLK COUNT(S) WRONG IN SUPERBLK                                            
SALVAGE? yes                                                                   

SUMMARY INFORMATION BAD                                                        
SALVAGE? yes                                                                   

BLK(S) MISSING IN BIT MAPS                                                     
SALVAGE? yes                                                                   

637 files, 10808 used, 164510 free (254 frags, 20532 blocks, 0.1% fragmentatio)

***** FILE SYSTEM MARKED CLEAN *****                                           

***** FILE SYSTEM WAS MODIFIED *****                                           
Loading configuration ...                                                      
vn_read_compressed: inflate of bytepos 86966272, offset in file = 51491159, er}
panic: bad inflate                                                             
cpuid = 0                                                                      
KDB: stack backtrace:                                                          
SP 0: not in kernel                                                            
uart_z8530_class+0x0 (0,0,0,0) ra 0 sz 0                                       
pid 54, process: md0                                                           
###Entering boot mastership relinquish phase                                   
KDB: enter: panic                                                              
[thread pid 54 tid 100048 ]                                                    
Stopped at      breakpoint+0x4: jr      ra                                     
db>                                                                            

Please note the following:

  1. a USB drive was in the device
  2. a Serial cable was in the device
  3. The device was NOT on UPS, and power unstability might have occured.
  4. 4 different network were created and 3 of these were monitored

hope someone can shed some light on what might have gone wrong.

Update2:
Pressing the power buttong did nothing, but pressing, and holding it for 6+ seconds turned the switch OFF. and when I turned it ON again, it loaded the config as normal. so, the device was NOT wiped this time, unlike the initial time.

Best Answer

It sounds like a couple of different issues are going on here...

Older versions of JunOS were known to corrupt things during a power failure. Remember that JunOS is based on FreeBSD, and so there was an implicit assumption that you would do a proper shutdown before killing power.

To mitigate this, JunOS has a rescue config. If the regular config is corrupt/unreadable, it would load the rescue config instead. Did you have your rescue config set? If not, you should have. My best practice is to update the rescue config on production systems after any config changes are finalized/tested/approved. This could explain the "revert to defaults" issue you had with the SRX210. (The second possibility is that your cluster was unhealthy and configs were not syncing between the nodes as expected. See the commands here to verify the cluster is working.)

Additionally, it was possible to actually corrupt the root filesystem of an older JunOS device, and it would fail to boot, at all, and would fail out to a db> prompt for debugging.

In newer versions of JunOS, the concerns about corruption due to power failure are pretty much resolved. The addition of Resilient Dual Root Partitions helped a lot. Note that if you are upgrading from a much older version that predates these features, there are some extra boot loader / partition table changes that are required during the upgrade process. http://www.juniper.net/techpubs/en_US/junos11.4/information-products/topic-collections/security/software-all/initial-config/index.html?topic-56813.html

Make sure you are running the latest recommended JunOS build on both SRXes. If you upgraded from a much older version, make sure you followed the instructions and did not skip a step as you may miss out on the dual partition features.

The failure you saw on the SRX100 looks like a root filesystem corruption issue (panic: bad inflate is a big clue). However, it looks like you upgraded to a new enough JunOS version that a corrupt root FS should never happen. Also, if you rebooted and it magically started working again, that smells like the builtin flash storage is dying. I would open a ticket with JTAC for a replacement or buy a new one.