Juniper SRX100 and SRX210 goes to factory reset for no reason

juniper-srx

We had two SRX210 in a HA cluster, that both accidentely lost power, and when they came up again, the both had lost ALL setup, (similar to a factory reset!), this was very strange and took us som time to fix. (this was a month ago)

Later (this week), another SRX100 on a totally different network (different continent also), died with similar problem… when I looked at the setup, it was all totally gone. had to recover from a backup.

have anyone seen such a problem? this was running on quite an old firmware 10.x is it a bug? attack? hardware problem?

Update: the SRX100 was replaced with a backup device (also SRX100), and the faulty device was upgraded to latest stable firmware, and loaded with the same-ish config as before. then it was setup in a test network, and stress-tested for a few days.. and this weekend it died again. (was showing RED light on STATUS, and no traffic passing trough it). the console window on serial was on the whole time and this is the content.

U-Boot 1.1.6-JNPR-2.7 (Build time: Nov 26 2013 - 19:04:49)                     

Initializing memory this may take some time...                             
Measured DDR clock 266.62 MHz                                                  
SRX_100_LOWMEM board revision major:0, minor:0, serial #: AT0112AF1168         
OCTEON CN5020-SCP pass 1.1, Core clock: 500 MHz, DDR clock: 266 MHz (532 Mhz d)
DRAM:  512 MB                                                                  
Starting Memory POST...                                                        
Checking datalines... OK                                                       
Checking address lines... OK                                                   
Checking 512K memory for U-Boot... OK.                                         
Running U-Boot CRC Test... OK.                                                 
Flash:  4 MB                                                                   
USB:   scanning bus for devices... 4 USB Device(s) found                       
       scanning bus for storage devices... 2 Storage Device(s) found           
Clearing DRAM....... done                                                      
BIST check passed.                                                             
Boot Media: nand-flash usb                                                     
Net:   pic init done (err = 0)octeth0                                          
POST Passed                                                                    
Press SPACE to abort autoboot in 1 seconds                                     
ELF file is 32 bit                                                             
Loading .text @ 0x8f0000a0 (246560 bytes)                                      
Loading .rodata @ 0x8f03c3c0 (14144 bytes)                                     
Loading .reginfo @ 0x8f03fb00 (24 bytes)                                       
Loading .rodata.str1.4 @ 0x8f03fb18 (16516 bytes)                              
Loading set_Xcommand_set @ 0x8f043b9c (96 bytes)                               
Loading .rodata.cst4 @ 0x8f043bfc (20 bytes)                                   
Loading .data @ 0x8f044000 (5744 bytes)                                        
Loading .data.rel.ro @ 0x8f045670 (120 bytes)                                  
Loading .data.rel @ 0x8f0456e8 (136 bytes)                                     
Clearing .bss @ 0x8f045770 (11600 bytes)                                       
## Starting application at 0x8f0000a0 ...                                      
Consoles: U-Boot console                                                       
Found compatible API, ver. 2.7                                                 

FreeBSD/MIPS U-Boot bootstrap loader, Revision 2.7                             
(ccheng@svl-junos-d081.juniper.net, Tue Nov 26 19:05:43 PST 2013)              
Memory: 512MB                                                                  
[0]Booting from nand-flash slice 2                                             
Un-Protected 1 sectors                                                         
writing to flash...                                                            
Protected 1 sectors                                                            
Loading /boot/defaults/loader.conf                                             
/kernel data=0xb0496c+0x1344a4 syms=[0x4+0x8a9e0+0x4+0xc8f47]                  


Hit [Enter] to boot immediately, or space bar for command prompt.              
Booting [/kernel]...                                                           
Kernel entry at 0x801000e0 ...                                                 
init regular console                                                           
Primary ICache: Sets 64 Size 128 Asso 4                                        
Primary DCache: Sets 1 Size 128 Asso 64                                        
Secondary DCache: Sets 128 Size 128 Asso 8                                     
GDB: debug ports: uart                                                         
GDB: current port: uart                                                        
KDB: debugger backends: ddb gdb                                                
KDB: current backend: ddb                                                      
kld_map_v: 0x8ff80000, kld_map_p: 0x0                                          
Copyright (c) 1996-2014, Juniper Networks, Inc.                                
All rights reserved.                                                           
Copyright (c) 1992-2006 The FreeBSD Project.                                   
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994       
        The Regents of the University of California. All rights reserved.      
JUNOS 12.1X44-D35.5 #0: 2014-05-19 21:36:43 UTC                                
    builder@dagmath.juniper.net:/volume/build/junos/12.1/service/12.1X44-D35.5l
JUNOS 12.1X44-D35.5 #0: 2014-05-19 21:36:43 UTC                                
    builder@dagmath.juniper.net:/volume/build/junos/12.1/service/12.1X44-D35.5l
real memory  = 536870912 (512MB)                                               
avail memory = 304193536 (290MB)                                               
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs                            
Security policy loaded: JUNOS MAC/pcap (mac_pcap)                              
Security policy loaded: JUNOS MAC/runasnonroot (mac_runasnonroot)              
netisr_init: !debug_mpsafenet, forcing maxthreads from 2 to 1                  
cpu0 on motherboard                                                            
: CAVIUM's OCTEON 5020 CPU Rev. 0.1 with no FPU implemented                    
        L1 Cache: I size 32kb(128 line), D size 8kb(128 line), sixty four way. 
        L2 Cache: Size 128kb, 8 way                                            
obio0 on motherboard                                                           
uart0: <Octeon-16550 channel 0> on obio0                                       
uart0: console (9600,n,8,1)                                                    
twsi0 on obio0                                                                 
dwc0: <Synopsis DWC OTG Controller Driver> on obio0                            
usb0: <USB Bus for DWC OTG Controller> on dwc0                                 
usb0: USB revision 2.0                                                         
uhub0: vendor 0x0000 DWC OTG root hub, class 9/0, rev 2.00/1.00, addr 1        
uhub0: 1 port with 1 removable, self powered                                   
uhub1: vendor 0x0409 product 0x005a, class 9/0, rev 2.00/1.00, addr 2          
uhub1: single transaction translator                                           
uhub1: 2 ports with 1 removable, self powered                                  
umass0: STMicroelectronics ST72682  High Speed Mode, rev 2.00/2.10, addr 3     
umass1: Kingston DT 101 G2, rev 2.00/1.00, addr 4                              
cpld0 on obio0                                                                 
pcib0: <Cavium on-chip PCI bridge> on obio0                                    
Disabling Octeon big bar support                                               
PCI Status: PCI 32-bit: 0xc041b                                                
pcib0: Initialized controller                                                  
pci0: <PCI bus> on pcib0                                                       
pci0: <serial bus, USB> at device 2.0 (no driver attached)                     
pci0: <serial bus, USB> at device 2.1 (no driver attached)                     
pci0: <serial bus, USB> at device 2.2 (no driver attached)                     
gblmem0 on obio0                                                               
octpkt0: <Octeon RGMII> on obio0                                               
cfi0: <AMD/Fujitsu - 4MB> on obio0                                             
Timecounter "mips" frequency 500000000 Hz quality 0                            
###PCB Group initialized for udppcbgroup                                       
###PCB Group initialized for tcppcbgroup                                       
da1 at umass-sim1 bus 1 target 0 lun 0                                         
da1: <Kingston DT 101 G2 PMAP> Removable Direct Access SCSI-0 device           
da1: 40.000MB/s transfers                                                      
da1: 15304MB (31342592 512 byte sectors: 255H 63S/T 1950C)                     
da0 at umass-sim0 bus 0 target 0 lun 0                                         
da0: <ST ST72682 2.10> Removable Direct Access SCSI-2 device                   
da0: 40.000MB/s transfers                                                      
da0: 1000MB (2048000 512 byte sectors: 64H 32S/T 1000C)                        
Trying to mount root from ufs:/dev/da0s2a                                      
WARNING: / was not properly dismounted                                         
Attaching /cf/packages/junos via /dev/mdctl...                                 
Mounted junos package on /dev/md0...                                           

Media check on da0                                                             
Automatic reboot in progress...                                                
** /dev/da0s2a                                                                 
** Last Mounted on /                                                           
** Root file system                                                            
** Phase 1 - Check Blocks and Sizes                                            
** Phase 2 - Check Pathnames                                                   
** Phase 3 - Check Connectivity                                                
** Phase 4 - Check Reference Counts                                            
** Phase 5 - Check Cyl groups                                                  
142 files, 75006 used, 75032 free (32 frags, 9375 blocks, 0.0% fragmentation)  

***** FILE SYSTEM MARKED CLEAN *****                                           
Verified junos signed by PackageProduction_12_1_0                              
Verified jboot signed by PackageProduction_12_1_0                              
Ignoring watchdog timeout during boot/reboot                                   
veriexec: cannot verify /packages/junos-12.1X44-D35.5-domestic.sig: ERROR: Faic
** /dev/bo0s3e                                                                 
** Last Mounted on /config                                                     
** Phase 1 - Check Blocks and Sizes                                            
** Phase 2 - Check Pathnames                                                   
** Phase 3 - Check Connectivity                                                
** Phase 4 - Check Reference Counts                                            
** Phase 5 - Check Cyl groups                                                  
19 files, 50 used, 12388 free (36 frags, 1544 blocks, 0.3% fragmentation)      

***** FILE SYSTEM MARKED CLEAN *****                                           
** /dev/bo0s3f                                                                 
** Last Mounted on /cf/var                                                     
** Phase 1 - Check Blocks and Sizes                                            
** Phase 2 - Check Pathnames                                                   
** Phase 3 - Check Connectivity                                                
** Phase 4 - Check Reference Counts                                            
** Phase 5 - Check Cyl groups                                                  
FREE BLK COUNT(S) WRONG IN SUPERBLK                                            
SALVAGE? yes                                                                   

SUMMARY INFORMATION BAD                                                        
SALVAGE? yes                                                                   

BLK(S) MISSING IN BIT MAPS                                                     
SALVAGE? yes                                                                   

637 files, 10808 used, 164510 free (254 frags, 20532 blocks, 0.1% fragmentatio)

***** FILE SYSTEM MARKED CLEAN *****                                           

***** FILE SYSTEM WAS MODIFIED *****                                           
Loading configuration ...                                                      
vn_read_compressed: inflate of bytepos 86966272, offset in file = 51491159, er}
panic: bad inflate                                                             
cpuid = 0                                                                      
KDB: stack backtrace:                                                          
SP 0: not in kernel                                                            
uart_z8530_class+0x0 (0,0,0,0) ra 0 sz 0                                       
pid 54, process: md0                                                           
###Entering boot mastership relinquish phase                                   
KDB: enter: panic                                                              
[thread pid 54 tid 100048 ]                                                    
Stopped at      breakpoint+0x4: jr      ra                                     
db>

Please note the following:

a USB drive was in the device
a Serial cable was in the device
The device was NOT on UPS, and power unstability might have occured.
4 different network were created and 3 of these were monitored

hope someone can shed some light on what might have gone wrong.

Update2:
Pressing the power buttong did nothing, but pressing, and holding it for 6+ seconds turned the switch OFF. and when I turned it ON again, it loaded the config as normal. so, the device was NOT wiped this time, unlike the initial time.

Best Answer

It sounds like a couple of different issues are going on here...

Older versions of JunOS were known to corrupt things during a power failure. Remember that JunOS is based on FreeBSD, and so there was an implicit assumption that you would do a proper shutdown before killing power.

To mitigate this, JunOS has a rescue config. If the regular config is corrupt/unreadable, it would load the rescue config instead. Did you have your rescue config set? If not, you should have. My best practice is to update the rescue config on production systems after any config changes are finalized/tested/approved. This could explain the "revert to defaults" issue you had with the SRX210. (The second possibility is that your cluster was unhealthy and configs were not syncing between the nodes as expected. See the commands here to verify the cluster is working.)

Additionally, it was possible to actually corrupt the root filesystem of an older JunOS device, and it would fail to boot, at all, and would fail out to a db> prompt for debugging.

In newer versions of JunOS, the concerns about corruption due to power failure are pretty much resolved. The addition of Resilient Dual Root Partitions helped a lot. Note that if you are upgrading from a much older version that predates these features, there are some extra boot loader / partition table changes that are required during the upgrade process. http://www.juniper.net/techpubs/en_US/junos11.4/information-products/topic-collections/security/software-all/initial-config/index.html?topic-56813.html

Make sure you are running the latest recommended JunOS build on both SRXes. If you upgraded from a much older version, make sure you followed the instructions and did not skip a step as you may miss out on the dual partition features.

The failure you saw on the SRX100 looks like a root filesystem corruption issue (panic: bad inflate is a big clue). However, it looks like you upgraded to a new enough JunOS version that a corrupt root FS should never happen. Also, if you rebooted and it magically started working again, that smells like the builtin flash storage is dying. I would open a ticket with JTAC for a replacement or buy a new one.

Related Solutions

Juniper SRX Site-to-Site VPN Issues – Changing IP and Default Route

so, I think I have found the solution finally.

at both sides I run this command:

show security ike security-associations

Site B showed

Index   State  Initiator cookie  Responder cookie  Mode           Remote Address   
<anumber> DOWN   <some data>  <0000000000000000>  Main           <correct ip>

which is logical, since the vpn IS down.

then I ran same on the other side (side A)

it showed this: Index Remote Address State Initiator cookie Responder cookie Mode UP Main UP Main
UP Main

which is strange since again, the vpn is DOWN..

so, I tried to run this command:

clear security ike security-associations index <index from first one above>

and then the vpn came up again... and I finally alro removed the other "old ip" line, and now only have one line (site A)

    <anumber> <newip>   UP     <some data>  <some data>  Main

and similarly site B shows

<anumber> UP     <some data>  <some data>  Main           <correct ip>

and the vpn seems stable.

I did not find this solution on any of the juniper FAQ, and googled a lot, but managed by a bit of luck to solve it. so I wanted to document it a bit here just in case someone (like me) need it.

IF: I am completely wrong here, and what happened here just luck and not related, I do hope that you will comment or edit thanks!

VPN – How to Set Up Backup IP for Site-to-Site VPN on Juniper SRX?

Junos does have DPD and you can use it in conjunction with multiple endpoint IP addresses in a single IKE tunnel.

There is a bit of info about it here (which I've copied below)

http://kb.juniper.net/InfoCenter/index?page=content&id=KB29211&actp=RSS

SUMMARY: This article explains how redundancy in site-to-site VPN can be achieved using multiple address in gateway and dead-peer-detection.

PROBLEM OR GOAL: How to use different modes of dead-peer-detection for VPN failover .

CAUSE:

SOLUTION: The gateway for VPN redundancy can be configured with the following commands :

set interfaces fe-0/0/0 unit 0 family inet address 1.1.1.2/24
set interfaces st0 unit 0 family inet
set routing-options static route 0.0.0.0/0 next-hop 1.1.1.1
set security ike policy p1 mode main
set security ike policy p1 proposal-set standard
set security ike policy p1 pre-shared-key ascii-text "$9$21oZjmfzCtOHqtO1RlegoJ"
set security ike gateway g1 ike-policy p1
set security ike gateway g1 address 2.2.2.1
set security ike gateway g1 address 3.3.3.1
set security ike gateway g1 dead-peer-detection interval 10
set security ike gateway g1 dead-peer-detection threshold 3
set security ike gateway g1 external-interface fe-0/0/0
set security ipsec policy p1 proposal-set standard
set security ipsec vpn v1 bind-interface st0.0
set security ipsec vpn v1 ike gateway g1
set security ipsec vpn v1 ike ipsec-policy p1
set security ipsec vpn v1 establish-tunnels immediately

The first address in the order of configuration is the one chosen to negotiate the tunnel:

gateway g1 {
            ike-policy p1;
            address [ 2.2.2.1 3.3.3.1 ];
            dead-peer-detection {
                                 interval 10;
                                 threshold 3;
                                 }
            external-interface fe-0/0/0;
            }

The above configuration is in dead-peer-detection optimal mode. It sends probes if packets were sent out (encrypted packets), but no packets were received (decrypted) for the configured interval. Three probe-packets are sent at 10 second intervals.

root@srx# run show security ike sa 
Index State Initiator cookie Responder cookie Mode Remote Address 
6770125 UP d570a30c806721ea ccc1572d2f763981 Main 2.2.2.1 


root@srx# run show security ipsec sa 
Total active tunnels: 1
ID Algorithm SPI Life:sec/kb Mon lsys Port Gateway 
<131073 ESP:3des/sha1 1debda06 3397/ unlim - root 500 2.2.2.1 
>131073 ESP:3des/sha1 7a7dff24 3397/ unlim - root 500 2.2.2.1

As soon as the tunnel drops, dead-peer-detection comes into play. If a response is not received from the peer in 30 seconds, the failover takes place and the tunnel is negotiated with 3.3.3.1 and vice-versa.

root@srx# run show security ike sa
Index State Initiator cookie Responder cookie Mode Remote Address 
6770151 UP 36a2e145e0fd2c10 b3abc0b135cf33fe Main 3.3.3.1

root@srx# run show security ipsec sa 
Total active tunnels: 1
ID Algorithm SPI Life:sec/kb Mon lsys Port Gateway 
<131073 ESP:3des/sha1 2420b2bd 3598/ unlim - root 500 3.3.3.1 
>131073 ESP:3des/sha1 5c8bb9da 3598/ unlim - root 500 3.3.3.1

Always-Send mode for dead-peer-detection:

In order to instruct the device to send dead-peer-detection requests, regardless of whether or not there is outgoing IPSec traffic to the peer, the following command is also needed:

set security ike gateway g1 dead-peer-detection always-send

UPDATE

I have configured this in a test lab and confirm that it works well. I have 3 devices, S3, S4 and S5.

S4 and S5 both have a basic IPSEC tunnel configured to connect to S3 (7.7.7.22 in my example). The config is dead simple and the same on both devices

ike {
    gateway s3-gw {
        ike-policy ike-policy;
        address 7.7.7.22;
        external-interface ge-0/0/1.0;
    }
}
ipsec {
    policy standard-ipsec-policy {
        proposal-set standard;
    }
    vpn s3 {
        bind-interface st0.0;
        ike {
            gateway s3-gw;
            ipsec-policy standard-ipsec-policy;
        }
        establish-tunnels immediately;
    }
}

The device S3 has a config that is very similar to the above but has 2 gateways listed and DPD enabled.

The relevant section is in the IKE config

gateway s4-s5-gw {
    address [ 7.7.7.21 192.168.211.2 ];
    dead-peer-detection {
        always-send;
        interval 10;
        threshold 3;
    }
    external-interface ge-0/0/1.0;
}

This brings up the tunnel as such

root@TEST-srx3> show security ike security-associations
Index   State  Initiator cookie  Responder cookie  Mode           Remote Address
1404200 UP     2f4f0465dc8c4556  d2e6022d0dc213c3  Main           7.7.7.21

root@TEST-srx3> show security ipsec sa
  Total active tunnels: 1
  ID    Algorithm       SPI      Life:sec/kb  Mon lsys Port  Gateway
  <131073 ESP:3des/sha1 d4428f3  3170/ unlim   -   root 500   7.7.7.21
  >131073 ESP:3des/sha1 5cda9108 3170/ unlim   -   root 500   7.7.7.21

If I deactivate the IKE/IPSEC config sections on S4 the tunnel drops and then comes back up connected to the 2nd gateway

root@TEST-srx3> show security ike security-associations

root@TEST-srx3> show security ipsec sa
  Total active tunnels: 0

Then after about 30 seconds (10 x 3)

root@TEST-srx3> show security ike security-associations
Index   State  Initiator cookie  Responder cookie  Mode           Remote Address
1404202 UP     35e54d457be6132f  0444ae31577c71a2  Main           192.168.211.2

root@TEST-srx3> show security ipsec sa
  Total active tunnels: 1
  ID    Algorithm       SPI      Life:sec/kb  Mon lsys Port  Gateway
  <131073 ESP:3des/sha1 93043b2  3595/ unlim   -   root 500   192.168.211.2
  >131073 ESP:3des/sha1 e5c551e4 3595/ unlim   -   root 500   192.168.211.2

If you need any help post some config snippets and I'll do my best to have a look!

UPDATE 2

I have built this whole thing in a mini lab. The problem I have found is that while you can use multiple gateways in your IKE configuration you will still need to have an IPSEC tunnel per ISP on each device. This is because you have multiple source IP addresses you want to potentially make an IPSEC tunnel from.

Lab Config

To save me posting a lot of config each SRX (A and B) has two IPSEC tunnels configured as shown below. The things to note are I'm using a single tunnel interface on each device, these are set to multipoint. You could use multiple ones if you wanted.

This config will provide full redundancy if a single ISP at site A and/or site B goes down.

I tested this by dropping the linked between SRX-A and SRX-1 and then dropping SRX-B and SRX-4. Due to me using BGP and DPD it took just over a minute for the tunnel to come back up but worked well!

Hopefully this will ultimately help you sort out your config!

SRX-A IPSEC Config

ike {
    gateway SRX-B_via_ISP1 {
        ike-policy ike-policy;
        address [ 6.6.6.6 5.5.5.5 ];
        dead-peer-detection {
            always-send;
            interval 10;
            threshold 3;
        }
        external-interface lo0.10;
        local-address 7.7.7.5;
    }
    gateway SRX-B_via_ISP2 {
        ike-policy ike-policy;
        address [ 6.6.6.6 5.5.5.5 ];
        dead-peer-detection {
            always-send;
            interval 10;
            threshold 3;
        }
        external-interface lo0.10;
        local-address 8.8.8.9;
    }
}
ipsec {
    policy standard-ipsec-policy {
        proposal-set standard;
    }
    vpn SRX-B_via_ISP1 {
        bind-interface st0.0;
        ike {
            gateway SRX-B_via_ISP1;
            ipsec-policy standard-ipsec-policy;
        }
        establish-tunnels immediately;
    }
    vpn SRX-B_via_ISP2 {
        bind-interface st0.0;
        ike {
            gateway SRX-B_via_ISP2;
            ipsec-policy standard-ipsec-policy;
        }
        establish-tunnels immediately;
    }
}

Best Answer

Related Solutions

Juniper SRX Site-to-Site VPN Issues – Changing IP and Default Route

VPN – How to Set Up Backup IP for Site-to-Site VPN on Juniper SRX?

Related Topic