I have installed 2 mellanox FDR dual-port ConnectX-3 HCA cards (CX354A), each to a separate machines. The machines are connected directly to each other (switchless configuration). Both ports on the cards are connected such that port1 is to port1 and port2 is to port2.
Each ports is configured as follow:
HCA1 port1: ib0 inet addr:192.168.10.13 Bcast:192.168.10.255 Mask:255.255.255.0
port2: ib1 inet addr:192.168.10.15 Bcast:192.168.10.255 Mask:255.255.255.0
HCA2 port1: ib0 inet addr:192.168.10.24 Bcast:192.168.10.255 Mask:255.255.255.0
port2: ib1 inet addr:192.168.10.26 Bcast:192.168.10.255 Mask:255.255.255.0
Running 2 opensm commands on HCA1 as below and ibstat shows that all 4 ports are up and active.
root@HCA1# opensm -g <ib0 GUID> --daemon
root@HCA1# opensm -g <ib1 GUID> --daemon
With the above configured, I can ping from any of the IP to any others from the above.
HOWEVER, when I disconnected cables for port1, ping does not work between the connected port2 pair.
Disconnecting port2 pair and connect only port1 pair, ping works fine even for disconnected port2 IP (?)
What could be the reason for this and how can I fix the problem. Please mention what extra info I should post.
What I'm trying to achieve is to establish a totally isolated link for each port pair and run separated openMPI processes to test and compare bandwidth for two infiniband cables at a same time. Could anyone advise on how this could be done?
As to what I have learnt, I think I need to create different partition key for each port pair. (currently they are using the default pkey 0xffff )
However this default pkey cannot be changed once the infiniband is configured during boot-up. Any suggestion or advice?
Both machines are running CentOS 6.4 and I have installed Mellanox OFED 1.5.3.
These are the output of the ibstat on both machines:
[root@HCA1 Desktop]# ifconfig ib0
ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:81:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:192.168.10.13 Bcast:192.168.10.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:21:8f11/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:4144160 errors:0 dropped:0 overruns:0 frame:0
TX packets:4141376 errors:0 dropped:2 overruns:0 carrier:0
collisions:0 txqueuelen:1024
RX bytes:702746349 (670.1 MiB) TX bytes:719570861 (686.2 MiB)
[root@HCA1 Desktop]# ifconfig ib1
ib1 Link encap:InfiniBand HWaddr 80:00:00:49:FE:82:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:192.168.10.15 Bcast:192.168.10.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:21:8f12/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1024
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
[root@HCA2 Desktop]# ifconfig ib0
ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:81:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:192.168.10.24 Bcast:192.168.10.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:21:8f51/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:4141382 errors:0 dropped:0 overruns:0 frame:0
TX packets:4144161 errors:0 dropped:2 overruns:0 carrier:0
collisions:0 txqueuelen:1024
RX bytes:703005597 (670.4 MiB) TX bytes:719323129 (685.9 MiB)
[root@HCA2 Desktop]# ifconfig ib1
ib1 Link encap:InfiniBand HWaddr 80:00:00:49:FE:82:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:192.168.10.26 Bcast:192.168.10.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:21:8f52/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1024
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
The loaded modules are as below:
[root@HCA1 Desktop]# /etc/init.d/openibd status
HCA driver loaded
Configured IPoIB devices:
ib0 ib1
Currently active IPoIB devices:
ib0
ib1
The following OFED modules are loaded:
rdma_ucm
rdma_cm
ib_addr
ib_ipoib
mlx4_core
mlx4_ib
mlx4_en
ib_mthca
ib_uverbs
ib_umad
ib_ucm
ib_sa
ib_cm
ib_mad
ib_core
iw_cxgb3
iw_nes
Best Answer
Ok, I'm not entirely familiar with the setup on CentOS but what I think is happening is this. That one or both copies of opensm are working on ib0 link but not other. ib0 being the default for OpenSM.
As I understand it you'll need two copies of opensm running on this particular setup because without a switch binding all HCA's together it's essentially two fabrics and you need to run the subnet manager on both fabrics. You've correctly picked that up but not actually run them correctly (specifically the 2nd instance).
Ping appears to work when both are connected because Linux is passing the ping to the second interface and responding for both IP's. All that's working over ib0 (Pair1).
Under ubuntu which I'm familiar with there is a config file /etc/default/opensm.
It sounds like it's different on CentOS. The format of that file on Ubuntu is used to run opensm with the right ports because you need an opensm subnet manager on each port.
Basically what you want to do is not run
twice but instead
Which will give output like:
Then run
Under Ubuntu the init script actually automates that process for ports=ALL (read from /etc/default/opensm) where ALL is a keyword picked up the by init script.
There is likely an init script for opensm under CentOS. In the mean time the above commands can be used or you can write your own startup script.
UPDATE: I'm not sure if it will make a difference or not but I also have the following two kernel modules loaded which you don't.
Have you also flashed your HCA's with the latest firmware? This is actually quite important. Don't assume they have the latest out of the factory.