Postgresql – Postgres, from time to time just hangs

databasepostgresql

We are using Postgres as DB backend for our Django project. Everything works fine except from time to time Postgres jus hangs. Load average spikes to 80 points, and there are many posgtres processes. As far I see problematic is our table with users online. On every request there is update, we are updating "last seen" column. On every minute there is cronjob deleting users inactive for 1 minute. And in example today, when there was Postgres hang I've seen many updates and deletes to and from this table. I think it's some kind of race condition?

This table with users online is not big. In rush hours there are ~800 records. Here is schema.

   Column   |           Type           |                            Modifiers                            
------------+--------------------------+-----------------------------------------------------------------
 id         | integer                  | not null default nextval('spoleczniak_online_id_seq'::regclass)
 postac_id  | integer                  | not null
 data       | timestamp with time zone | not null
 zalogowany | timestamp with time zone | not null
Indexes:
    "spoleczniak_online_pkey" PRIMARY KEY, btree (id)
    "spoleczniak_online_postac_id_key" UNIQUE CONSTRAINT, btree (postac_id)
    "spoleczniak_online_data" btree (data)
Foreign-key constraints:
    "spoleczniak_online_postac_id_fkey" FOREIGN KEY (postac_id) REFERENCES postac_postacie(id) DEFERRABLE INITIALLY DEFERRED

Normally Postgres is generating less than 1.5 load average points. We are running it on i7, 16 GB of RAM, and hardware RAID-10 (3×2 disks) for OS/data + RAID-1 (2 disks) for WAL. Hardware RAID has 512 MB of cache.

I've tried 9.0 and 9.1 beta (I even set no WAL logs for that table in 9.1).

I'm really thinking about moving to MySQL but don't know any conversion tools. 😛

PS. I don't know if it's more Serverfault or Stackoverflow… but as you can see I decided to put it on Serverfault. 😛

Edit:

Some information from logs:

Jul 31 20:37:16 postgres postgres[1420]: [3-1] LOG:  00000: process 1420 acquired ExclusiveLock on tuple (29,7) of relation 33107 of database 20005 after 2071.481 ms
Jul 31 20:37:16 postgres postgres[1420]: [3-2] LOCATION:  ProcSleep, proc.c:1076
Jul 31 20:37:16 postgres postgres[1420]: [3-3] STATEMENT:  UPDATE "spoleczniak_online" SET "postac_id" = 101651, "data" = '2011-07-31 20:39:18.000699', "zalogowany" = '2011-07-31 20:31:04.843741' WHERE "spoleczniak_online"."id" = 559650 
Jul 31 20:37:16 postgres postgres[1493]: [3-1] LOG:  00000: process 1493 acquired ExclusiveLock on tuple (29,7) of relation 33107 of database 20005 after 1393.154 ms
Jul 31 20:37:16 postgres postgres[1493]: [3-2] LOCATION:  ProcSleep, proc.c:1076
Jul 31 20:37:16 postgres postgres[1493]: [3-3] STATEMENT:  UPDATE "spoleczniak_online" SET "postac_id" = 101651, "data" = '2011-07-31 20:39:15.646537', "zalogowany" = '2011-07-31 20:31:04.843741' WHERE "spoleczniak_online"."id" = 559650 

So ok, it's locking issue… but how can I avoid that?

Best Answer

This seems like locking problem: some large query locks too much, so other running queries cannot complete until lock is freed. Turn on slow query logging, add more monitoring. I strongly suggest using some monitoring system (zabbix, zenoss, etc) to be aware of your postgresql database status.

As a quick 'monitoring', take a look to pg_stat_activity and pg_locks views during the hang. Here is a good query that I'm often starting with:

SELECT
  datname, 
  NOW() - query_start AS duration, 
  procpid, 
  current_query 
FROM
  pg_stat_activity 
WHERE
    current_query <> '<IDLE>'
  AND
    NOW() - query_start > '1 second'
ORDER BY duration DESC;

It often gives enough information, but sometimes I have to run more in order to understand what's going on.

Please update your question with output of this (or similar) query.