Linux – Postgresql behind firewall: Query takes too long

firewalliptableslinuxpostgresql

Here is my setup: two CentOS 5.2 boxes on VMWare ESXi 4.0. The first box ip is 192.168.22.52 on eth0 and 192.168.99.1 on eth1. The second box runs PostgreSQL 8.3 with ip 192.168.99.2 on eth0. Here are iptables for box1, for box2 see comment below.

I have set up port 5432 forwarding on box1 and am able to connect to PostgreSQL on box2 via pgAdminIII or psql from Vista notebook (192.168.22.1, there is no other boxes in this subnet, it has its own switch and is physically isolated). The database I am connecting to has two schemas, one is 'smaller' (basically just one table), another one is bigger (some 30 tables, 100 functions, etc.) So I am able to work with the smaller schema (browse the table and so on) but I when I try to expand the bigger schema – pgAdminIII freezes for 20 min or so.

PostgreSQL log shows there is a query which takes way too long:

2009-06-04 21:04:46 EEST LOG:  00000: duration: 493578.874 ms  statement: 
SELECT pr.oid, pr.xmin, pr.*, format_type(TYP.oid, NULL) AS typname, 
typns.nspname AS typnsp, lanname, proargnames, proconfig,
        pg_get_userbyid(proowner) as funcowner, description
              FROM pg_proc pr
              JOIN pg_type typ ON typ.oid=prorettype
              JOIN pg_namespace typns ON typns.oid=typ.typnamespace
              JOIN pg_language lng ON lng.oid=prolang
              LEFT OUTER JOIN pg_description des ON des.objoid=pr.oid
             WHERE proisagg = FALSE AND pronamespace = 2200::oid
               AND typname <> 'trigger'
             ORDER BY proname

Both box1 and box2 are clones of the development boxes, and the original network structure was different – box2 was directly accessible without port forwarding and there was no problem accessing the databases whatsoever.

Now, if I run the above query via psql on box2 or 'original' machine, or from box1 connecting to box2, it executes immediately.

During the query is run, tcpdump on the box2 periodically says:

12:45:39.770609 IP 192.168.99.2.postgres > 192.168.22.1.49484: . 8760:10220(1460) ack 1 win 54
12:45:39.968496 IP 192.168.22.1.49484 > 192.168.99.2.postgres: . ack 10220 win 16425
12:45:39.968541 IP 192.168.99.2.postgres > 192.168.22.1.49484: . 10220:11680(1460) ack 1 win 54
12:45:39.968574 IP 192.168.99.2.postgres > 192.168.22.1.49484: . 11680:13140(1460) ack 1 win 54
12:45:39.969250 IP 192.168.22.1.49484 > 192.168.99.2.postgres: . ack 13140 win 16425
12:45:39.969275 IP 192.168.99.2.postgres > 192.168.22.1.49484: . 13140:17520(4380) ack 1 win 54
12:45:39.969408 IP 192.168.22.52 > 192.168.99.2: ICMP 192.168.22.1 unreachable - need to frag (mtu 1500), length 556

Other than that, I do not see much traffic. MTU on all ethN interfaces is 1500. ping -l 1472 -f 192.168.99.1 from the notebook goes through without problems.

I suspect that I am missing something about iptables or network setup and would appreciate your advise.

Best Answer

Some things to try:

  1. Start off by verifying your network is behaving itself. Assuming you have managed switches, look at the interface statistics for speed/duplex mismatching or a mismatched MTU. Consider checking / replacing cabling if anything is running errors (eg: trying to run GigE over Cat5 instead of Cat5e will likely give grief).

  2. Run some tests to prove you can get wire-speed transfers between the two machines and to the external machine; netcat, ftp or http transfers are a good start here (scp may get CPU bound, and thus, may not be the best test).

  3. Test the same query locally on the Postgres server. If it completes in an appropriate timeframe, you know it's not the database. If it doesn't complete or takes "too long", then you have a bad query or other database problem to debug. Make sure to consider the storage I/O side of things; you may be saturating what your disks are capable of providing. Check the VMware performance graphs to confirm / deny.

  4. Assuming that works, disable the firewall and run the same query against the postgres server from "box1". If that works, the VM->VM connectivity is likely fine.

  5. Assuming that works, bring the firewall back up and test again. If that works, then your problem is likely external to that host, leaving the switch or the external host to debug.

Good luck.