Linux – mmap slower than ioremap

gpiokernellinuxmmap

I am developing for an ARM device running Linux 2.6.37. I am trying to toggle an IO pin as fast as possible. I made a little kernel module and a user space application. I tried two things :

  1. Manipulate the GPIO control registers directly from the kernel space using ioremap.
  2. mmap() the GPIO control registers without caching and using them from user space.

Both methods work, but the second is about 3 times slower than the first (observed on oscilloscope). I think I disabled all caching mechanisms.

Of course I'd like to get the best of the two worlds : flexibility and ease of development from user space with the speed of kernel space.

Does anybody know why the mmap() could be slower than the ioremap() ?

Here's my code :

Kernel module code

static int ti81xx_usmap_mmap(struct file* pFile, struct vm_area_struct* pVma)
{
  pVma->vm_flags |= VM_RESERVED;
  pVma->vm_page_prot = pgprot_noncached(pVma->vm_page_prot);

  if (io_remap_pfn_range(pVma, pVma->vm_start, pVma->vm_pgoff,
                          pVma->vm_end - pVma->vm_start, pVma->vm_page_prot))
     return -EAGAIN;

  pVma->vm_ops = &ti81xx_usmap_vm_ops;
  return 0;
}

static void ti81xx_usmap_test_gpio(void)
{
  u32* pGpIoRegisters = ioremap_nocache(TI81XX_GPIO0_BASE, 0x400);
  const u32 pin = 1 << 24;
  int i;

  /* I should use IO read/write functions instead of pointer deferencing, 
   * but portability isn't the issue here */

  pGpIoRegisters[OMAP4_GPIO_OE >> 2] &= ~pin;    /* Set pin as output*/

  for (i = 0; i < 200000000; ++i)
  {
     pGpIoRegisters[OMAP4_GPIO_SETDATAOUT >> 2] = pin;
     pGpIoRegisters[OMAP4_GPIO_CLEARDATAOUT >> 2] = pin;
  }

  pGpIoRegisters[OMAP4_GPIO_OE >> 2] |= pin;    /* Set pin as input*/

  iounmap(pGpIoRegisters);
}

User space application code

int main(int argc, char** argv)
{
   int file, i;
   ulong* pGpIoRegisters = NULL;
   ulong pin = 1 << 24;

   file = open("/dev/ti81xx-usmap", O_RDWR | O_SYNC);

   if (file < 0)
   {
      printf("open failed (%d)\n", errno);
      return 1;
   }


   printf("Toggle from kernel space...");
   fflush(stdout);

   ioctl(file, TI81XX_USMAP_IOCTL_TEST_GPIO);

   printf(" done\n");    

   pGpIoRegisters = mmap(NULL, 0x400, PROT_READ | PROT_WRITE, MAP_SHARED, file, TI81XX_GPIO0_BASE);
   printf("Toggle from user space...");
   fflush(stdout);

   pGpIoRegisters[OMAP4_GPIO_OE >> 2] &= ~pin;

   for (i = 0; i < 30000000; ++i)
   {
      pGpIoRegisters[OMAP4_GPIO_SETDATAOUT >> 2] = pin;
      pGpIoRegisters[OMAP4_GPIO_CLEARDATAOUT >> 2] = pin;
   }

   pGpIoRegisters[OMAP4_GPIO_OE >> 2] |= pin;

   printf(" done\n");
   fflush(stdout);
   munmap(pGpIoRegisters, 0x400);    

   close(file);    
   return 0;
}

Best Answer

This is because ioremap_nocache() still enables the CPU write buffer in your VM mapping whereas pgprot_noncached() disables both bufferability and cacheability.

Apples to apples comparison would be to use ioremap_strongly_ordered() instead.

Related Topic