Networking VPN GCP – VPN Between On-Prem and GCP: Routes Shared but Ping Doesn’t Go Through

bgpgoogle-cloud-platformipsecnetworkingvpn

I have been struggling with the VPN setup between on-prem and GCP for more than a week. I am completely out of ideas at this point, and would love to get some help of network specialists.

Goal

The end goal is simple: to get a VM instance on GCP to seamlessly talk to a VM on-prem – but with 2 routers in play.
The setup is something like below:

       GCP_VM                                                           OP_VM
    10.0.0.25                                                    10.100.0.200
            |                                                    |
            |                                           (DC Router Gateway)
            |                                               10.100.0.80
            |                                                    |
            └-- HA_VPN (AS65001) <==========> Router (AS65002) --┘

     Public IP: xx.xx.xx.xx                   yy.yy.yy.yy
     Advertise: 10.0.0.0/24 BGP               10.100.0.0/24 BGP
  VPN IP Range: 169.254.0.1/30                169.254.0.2 (as Peer)
    Private IP: NA                            10.100.0.50

The complication here is that Router here is not directly connected to OP_VM. This is the on-prem setup we have no control over. OP_VM gets its IP 10.100.0.200 from some other router, and our Router is put on to the same LAN. We only get a single rack in the data centre, and need to reach OP_VM which is hosted by other party (in some other rack). Our rack is associated with 10.100.0.50.

And with this, I want to be able to get the below work:

me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200

Current Status

With the above setup, VPN and BGP seem healthy from the logs on both sides.

From GCP_VM, I can ping 10.100.0.50 (Router) successfully.

me@GCP_VM:10.0.0.25:~$ ping 10.100.0.50
PING 10.100.0.50 (10.100.0.50) 56(84) bytes of data.
64 bytes from 10.100.0.50: icmp_seq=1 ttl=254 time=24.9 ms
...

Also, from Router, I could confirm I can ping 10.100.0.200 (OP_VM).

# With the Router setup of something like
#
#     ip route 10.100.0.0/24 gateway 10.100.0.80

root@Router:10.100.0.50:~$ ping 10.100.0.200
ping 10.100.0.200
received from 10.100.0.200: icmp_seq=0 ttl=63 time=0.583ms
received from 10.100.0.200: icmp_seq=1 ttl=63 time=0.571ms

2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max = 0.571/0.577/0.583 ms

From GCP_VM, though, ping to 10.100.0.200 (OP_VM) goes missing.

# With the Router setup of something like
#
#     ip route 10.100.0.0/24 gateway 10.100.0.80

me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200
PING 10.100.0.200 (10.100.0.200) 56(84) bytes of data.
^C
--- 10.100.0.200 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3051ms

I'm probably misunderstanding the gateway setup, but changing the route like below gives me a different result:

# With the Router setup of something like
#
#     ip route 10.100.0.0/24 gateway 10.100.0.50
#                                             ~~ <- Router itself

me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200
PING 10.100.0.200 (10.100.0.200) 56(84) bytes of data.
From 169.254.0.2 icmp_seq=7 Destination Host Unreachable
From 169.254.0.2 icmp_seq=6 Destination Host Unreachable
From 169.254.0.2 icmp_seq=5 Destination Host Unreachable
From 169.254.0.2 icmp_seq=4 Destination Host Unreachable
From 169.254.0.2 icmp_seq=3 Destination Host Unreachable
From 169.254.0.2 icmp_seq=2 Destination Host Unreachable
From 169.254.0.2 icmp_seq=1 Destination Host Unreachable
^C
--- 10.100.0.200 ping statistics ---
9 packets transmitted, 0 received, +7 errors, 100% packet loss, time 8141ms
pipe 7

With this gateway setup, Router can no longer ping OP_VM. This at least seems to me that VPN is established and IP is advertised correctly. But this does not look right from the actual networking point of view.

Questions

I don't think there is much more to be done on GCP side, and the issue seems to be purely on the on-prem.

Is there any setup issues, or concerns that may cause misbehaviour of VPN, BGP, ARP, etc.? What would cause such a case where routes seem to be shared, but cannot actually access them?


Other Notes

  • I have confirmed the ARP table on Router includes 10.100.0.200
  • I can see the routes propagated in GCP
  • I have tested with GCP VPC's Firewall setup to allow 169.254.0.0/30 and 10.100.0.0/24
  • I will need access from GKE in the end, but I have confirmed GKE is getting the same exact behaviour as GCP_VM
  • Router is from Yamaha
  • Tried TCPdump (packetdump in Yamaha routers), but did not see 10.0.0.25 in the log
  • TCPdump did show the trace of 10.0.0.25 when I ran nmap -Pn 10.100.0.200 from GCP_VM, but with single line like this:
2019/12/21 16:35:40: LAN1 OUT:IP TCP 10.100.0.227:50516 > 10.103.24.1:80

Update (24th Dec)

I have done tcpdump for simple ping between GCP_VM and Router.

From GCP_VM to Router (logs from GCP_VM)

$ ping 10.100.0.50 > /dev/null &
$ sudo tcpdump -i eth0 | grep 10.100
...
18:49:18.696178 IP GCP_VM.(snip) > 10.100.0.50: ICMP echo request
, id 32396, seq 0, length 64
18:49:18.700395 IP 10.100.0.50 > GCP_VM.(snip): ICMP echo reply, 
id 32396, seq 0, length 64

From Router to GCP_VM (logs from GCP_VM)

# ping from Router, with `ping 10.0.0.25`
$ sudo tcpdump -i eth0 | grep 169.254
...
18:40:18.554555 IP 169.254.0.2 > GCP_VM.(snip): ICMP echo request,
 id 3369, seq 0, length 72
18:40:18.554586 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo reply, i
d 3369, seq 0, length 72

Although tcpdump shows the reply is being sent here, it is never received by Router.
Also, ping to 169.254.0.2 from GCP_VM gets no reply.

$ ping 169.254.0.2 > /dev/null &
$ sudo tcpdump -i eth0 | grep 169.254
...
18:59:07.113101 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo request, i
d 32531, seq 0, length 64
18:59:08.137103 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo request, i
d 32531, seq 1, length 64
...

Update (27th Dec)

Ping from the Router was successful after setting its source address to 10.100.0.50, as it was trying to use 169.254.0.2 by default.

The ping still doesn't reach OP_VM, and I'm still facing NAT configuration issue to ensure the translation goes correctly.

Update (31st Dec)

The connection has been finally set up. I'll be summarising the steps taken in a separate answer to declutter the question.

Best Answer

It's looks like a routing problem on-prem. I think, OP_VM doesn't have a route to 10.0.0.0/24 and as result send it to the default gateway DC Router Gateway and there it's dropped because DC Router Gateway (10.100.0.80) also doesn't have route to 10.0.0.0/24 (because you have peering at Router).

To solve it you should set a static route at OP_VM to 10.0.0.0/24 via Router and keep DC Router Gateway as a default gateway.

You have to remove route ip route 10.100.0.0/24 gateway 10.100.0.50 from Router- network 10.100.0.0/24 is directly connected to him.

EDIT

From GCP_VM, I can ping 10.100.0.50 (Router) successfully.

At this point it looks like you have properly configured peering between Router and HA_VPN.

You should be able to ping GCP_VM and OP_VM from Router and also Router from OP_VM to be on a right path.

With the Router setup of something like

 ip route 10.100.0.0/24 gateway 10.100.0.80

With the Router setup of something like

 ip route 10.100.0.0/24 gateway 10.100.0.80

You don't need these routes because Router is directly connected to subnet 10.100.0.0/24 and has an IP 10.100.0.50

From GCP_VM, though, ping to 10.100.0.200 (OP_VM) goes missing.

It's expected because OP_VM and DC Router Gateway don't have a route to 10.0.0.0/24 as I mentioned above and can't reply and you have to set a static route at OP_VM to 10.0.0.0/24 via Router and keep DC Router Gateway as a default gateway.

EDIT 2 OP_VM sent replies to DC Router Gateway because it's doesn't have a route to 10.100.0.0/24 and it try to reach it via default gateway, and at DC Router Gateway they've dropped because there's no route also.

You should add a static route at OP_VM or at DC Router Gateway to 10.100.0.0/24 to solve it.

Related Topic