OpenCL on GEGL: Results up to now

Hello everyone! I’m glad to show you the results up to now of my GSoC project about adding OpenCL support to the General Graphics Library.

What I’ve done

GEGL has two basic data types:

  • GeglTile
  • GeglBuffer

A GeglBuffer can be seen as a layer in a image editing tool, they can be translated, cut, duplicated, etc. A final image is a composition of buffers. A buffer is composed by many GeglTiles, which are rectangular regions of pixels with same size, so pixel data like color is stored in tiles. This architecture is very flexible and allows for example that tiles may be stored in the disk, in a network or compacted.

What I want in my project is to be able to process tiles using an OpenCL device, like GPUs or even a multi-core CPU, the solution I implemented is that each tile has two states, the host memory data and a pointer to a OpenCL memory buffer and each one has its revision number which are used for synchronization.

This synchronization is achieved through locks. For example, suppose gegl_buffer_get is called for a buffer which tiles are being processed in the GPU. This function asks for buffer data to be copied to a pointer, as such, each buffer’s tile is going to be locked for reading, this locking process will verify the revision numbers and move data from the GPU to the CPU accordingly. The picture below illustrate this architecture:

An Example of Use

I’ll show an example of use of gegl buffer iterators to implement a Brightness-Contrast filter using OpenCL.

First, we define the OpenCL kernel that will be executed for each tile:

    const char* kernel_source[] =
    {
    "sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE |              \n",
    "                    CLK_ADDRESS_NONE            |              \n",
    "                    CLK_FILTER_NEAREST;                        \n",
    "__kernel void kernel_bc(__read_only  image2d_t in,             \n",
    "                        __write_only image2d_t out,            \n",
    "                         float brightness,                     \n",
    "                         float contrast)                       \n",
    "{                                                              \n",
    "  int2 gid = (int2)(get_global_id(0), get_global_id(1));       \n",
    "  float4 in_v  = read_imagef(in, sampler, gid);                \n",
    "  float4 out_v;                                                \n",
    "  out_v.xyz = (in_v.xyz - 0.5f) * contrast + brightness + 0.5f;\n",
    "  out_v.w   =  in_v.w;                                         \n",
    "  write_imagef(out, gid, out_v);                               \n",
    "}                                                              \n",
}

So, each tile is a OpenCL image2d_t type which can be read-only or write-only and be must be fetched through a sampler.

Now let’s see the iterator code:

    i = gegl_buffer_iterator_new (buffer_write, NULL, NULL, GEGL_BUFFER_CL_WRITE);
    index = gegl_buffer_iterator_add (i, buffer_read, NULL, NULL, GEGL_BUFFER_CL_READ);
    while (gegl_buffer_iterator_next (i))
      {
        GeglClTexture *in_tex  = i->cl_data[index];
        GeglClTexture *out_tex = i->cl_data[0];
        size_t global_worksize[2] = {i->roi[0].width, i->roi[0].height};

        CL_SAFE_CALL( errcode = gegl_clSetKernelArg(kernel, 0, sizeof(cl_mem),   (void*)&in_tex->data) );
        CL_SAFE_CALL( errcode = gegl_clSetKernelArg(kernel, 1, sizeof(cl_mem),   (void*)&out_tex->data) );
        CL_SAFE_CALL( errcode = gegl_clSetKernelArg(kernel, 2, sizeof(cl_float), (void*)&brightness) );
        CL_SAFE_CALL( errcode = gegl_clSetKernelArg(kernel, 3, sizeof(cl_float), (void*)&contrast) );

        CL_SAFE_CALL( errcode = gegl_clEnqueueNDRangeKernel(gegl_cl_get_command_queue(), kernel, 2,
                                                      NULL, global_worksize, NULL,
                                                      0, NULL, NULL) );
        CL_SAFE_CALL( errcode = gegl_clFinish(gegl_cl_get_command_queue()) );
      }

The key point here is the GEGL_BUFFER_CL_WRITE and GEGL_BUFFER_CL_READ flags passed to the iterator. They mean that writing and reading will be done through whatever OpenCL device [GPU or CPU] we’re using. This code just executes the kernel defined above for each tile.

Before entering the iteration in buffer_read and buffer_write, all data from buffer_read is copied to the GPU [of course, only if it is the most recent]. At the end, the OpenCL revision numbers from buffer_write’s tiles are bumped.

If after all that we do this:

gegl_buffer_get (buffer_write, 1.0, NULL, NULL, buf_write, GEGL_AUTO_ROWSTRIDE);

This means we want to copy buffer_write’s data to a pointer in the host memory, so we have to synchronize host and GPU data versions before that. So all buffer functions will always return the most recent data version and, at the same time, memory transferences will be made only if necessary.

Here is a fluxogram  of what is happening in this code:

Full code

Performance Results

Running the Brightness-Contrast code with a 1 mega pixel image using a NVidia Tesla C2050 as OpenCL device and a Intel Xeon E5506 as comparison [just using one core, but the code uses SSE2].

Time of memory transferences to the GPU was considered in this benchmark [EDIT: this time considers transferring data back and forth between GPU and CPU].

  • CPU Elapsed time: 526 milliseconds
  • OpenCL Elapsed time: 483 milliseconds

Also, here is a chart from NVidia profiler showing how execution time was spent:

Almost 80% of total execution time has been spent in memory transferences to and from the GPU. This is a good result, because even with this overhead the results were reasonable. Consider that the typical use case of GEGL is doing many operations in sequence, so the ratio processing/memory transferences tends to be higher. In fact, the case present here is the worst-case.

Possible Improvements

There are a lot of things that can be done in order to increase current code speed:

Intercalate execution of tiles with memory transferences of others tiles

GPU hardware (at least modern NVidia GPUs) has separated  units for processing and memory transferences, we can use this to intercalate tiles processing and copying.

Tiles sharing the same OpenCL memory buffer

There is a lot of overhead in allocating a GPU texture for each tile. which is typically small [128x64]. I think the best way to tackle this problem is allocating a big chunk of memory and using offsets in this chunk when processing [it's impossible to have pointers to GPU memory], the problem is that GEGL is supposed to abstract the user  this kind of stuff from the user. Another idea is to serialize execution by having a pool of textures which can be reused by tiles, this would be good also because GPU memory is smaller than Host memory in general, the direct mapping CPU<->GPU cannot stand in fact.

Multiple OpenCL Command Queues

Command Queues can be executed concurrently in the same device, the Fermi architecture from NVidia can run 16 kernels at the same time, for example. This can be used to solve the memory transference overhead also.

Next Steps

I have yet to finish the implementation of a operator interface for OpenCL and make some OpenCL operators in order to create a useful chain of processing only in the GPU.

As the time for a GSoC project is very limited, my mentor and I decided to let optimizations outside the project, but I intend to work on them as soon as I can :)

Conclusion

The use of locking in order to synchronize CPU and GPU data was the most challenging part of the implementation, but after extensive testing. I think it’s working now, though I took more time than I expected to make it run properly.

Moreover, results so far show that using OpenCL to speed up Gegl is feasible and very interesting, thought still there is some challenges to be tackled, the tiled structure of Gegl allows a lot of optimizations.

The Gegl OpenCL branch is here.

About these ads
This entry was posted in gsoc. Bookmark the permalink.

52 Responses to OpenCL on GEGL: Results up to now

  1. That’s really interesting, you’ve done a nice job so far my friend, keep it up!

  2. Really cool work !

  3. k says:

    Very cool :-)

    (btw, the card you tested on, would you recommend it for Linux-based GIMP users who want to take advantage of OpenCL when 3.0 arrives?)

    • Victor Oliveira says:

      I think any recent nvidia/amd card can do the job.
      The problem is more about OpenCL drivers in Linux, as there aren’t any free implementations yet.
      btw, the card I used is very expensive, but it’s massive. I used it because it’s the best one we have here in my lab. But as I said, any good card should do just fine :)

  4. Nikolay Antonov says:

    Nice work!
    What kind of filters and instruments can get boost from OpenCL?
    Can I use OpenCL-enabled GIMP now?

    • Victor Oliveira says:

      It’s still an ongoing gsoc project, but you can get the current version in my gegl’s branch: http://git.gnome.org/browse/gegl/?h=gsoc2011-opencl.
      I haven’t implemented any GEGL operator in OpenCL yet, but all that have a “regular” memory access pattern can be boosted, like: color conversions, overlay of layers, gaussian blur, brightness-contrast, etc.

  5. John says:

    Come on people… HURRY UP!! You didn’t release a stable version since 10/10/2010… ALMOST A YEAR. I’m tired of version 2.6.11 it crashes all the time, it made me loose lots of hours of work, the floating windows and palettes are a freaking unproductive nightmare, the support for graphic tablets is so poor, I can’t even use the mouse and the tablet at the same time!! And I’m also getting sick of that mediocre philosophy: “you can’t complain because it’s free”, Firefox is also free but users complain all they want.

    • Sonic4Spuds says:

      John,

      Your issue of not being able to use the tablet and mouse at the same time is an issue that only can be addressed by the operating system (Or I suppose, by some major changes in the software itself). Second, complaining to a GSOC student is not the person to complain to about trunk development. Third, if 2.6.11 is crashing you should file a bug report against it, instead of just complaining that it crashes.

      -Sonic

  6. Pingback: GIMP Single-Window Mode Almost Ready, Hardware Acceleration Planned | PHP World

  7. Pingback: Ein-Fenster-Modus fast fertig, GPU-Beschleunigung geplant « dennis-dorsch.de | Webdesign | Programmierung | News | Leipzig

  8. Pingback: Gimp меняет цикл подготовки релизов. Выпуск 2.8 с однооконным интерфейсом почти готов | AllUNIX.ru – Всероссийский портал о UNIX-системах

  9. Sonic4Spuds says:

    Thanks for the report on your progress, I have been looking for reports on the progress of the different projects for the last few months without success.

    -Sonic

  10. Pingback: Gimp меняет цикл подготовки релизов. Выпуск 2.8 с однооконным интерфейсом почти готов

  11. Pingback: Gimp: Ein-Fenster-Modus fast fertig, GPU-Beschleunigung geplant « com-Nachrichten.de

  12. Jon Nordby says:

    How is the performance like with bigger problem sizes? Say 10 MPix or 50 MPix?

  13. Jon Nordby says:

    That 80% of the time is spent on memory means that is where the optimization needs to go.
    For a 1 MPixel problem size, transferring all the data in one direction should ideally take less than 100 ms [1]. Right now it takes about 200 ms.
    I suspect this poor memory transfer performance is due to the latency of the memory transfer that occurs for every tile being transferred separately. Have you considered coalescing the data transfer and processing of tiles?

    1. 1.0 MPix, 4 channels, 4 bytes per channel = 128 MB. PCI Express 16x 2.0 bandwidth is max 8GB/s. 8000MB/s / 128MB = 62 ms

    • Victor Oliveira says:

      That’s exactly what’s happening. But to coalesce the memory transfer for many tiles means they should share the same buffer, which has the consequences I explained.
      Also, there is a problem for very large images where OpenCL fails in processing the image, I suspect it’s because there is a maximum number of textures [or descriptors, anyway], but it’s something I have to give a look after GSoC.
      As I said, this architecture where each tile has its GPU memory doesn’t scale well, we should think about other solution for that.

    • Mikez says:

      “1. 1.0 MPix, 4 channels, 4 bytes per channel = 128 MB. PCI Express 16x 2.0 bandwidth is max 8GB/s. 8000MB/s / 128MB = 62 ms”

      Are you sure? 1M * 4 * 4 = 16M to me. You’re counting the bits, but PCI Express 16x 2.0 is 8Gbytes/sec or 64Gbits according to wikipedia.

      Actually your maths is wrong anyway, it isn’t 62ms, it’s capable of transferring 128MB of data, 62 times per second = 16ms. Or with 16MB that would be 2ms elapsed. That seems more inline with what i’ve seen.

      • Jon Nordby says:

        Thanks for the correction. I must have been more than halfway asleep when I did those calculations. 2ms means the transfers are two orders of magnitude slower than ideal, which is indeed more expected results when doing such small memory transfers.

  14. Jon Nordby says:

    What is the utilization of the GPU computational units when running your kernel?
    I suspect it is very low as you are limited to the size of one tile. 128×64 is only 8k threads, which is pretty low for these devices.
    This is less of a point right now than above comment as the processing only takes 20% of the time, but is something to be aware of in general.

    • Victor Oliveira says:

      I think there should be a way to “group” tiles for memory transfs and processing, this would also solve the coalescing problem. We could also increase tiles’ size, but this isn’t a good solution.

      • Jon Nordby says:

        Yes, we would need this to make it perform. I guess the challenge is that the processing is currently very tile-centric in GEGL, but it should all be internal API that we can fix.

  15. Jon Nordby says:

    I find it a bit tricksy of you to ignore the time it takes to copy back the tiles when giving the total execution time. Especially because you include this further down in the discussion (when talking about 80% memory transfer time).
    If the point is to have several operations running on the GPU side, and you want numbers appropriate for that case, please use a benchmark that actually tests this instead of playing with the numbers (however pure your motives are) :)

    • Victor Oliveira says:

      Aha, I was expecting someone to point out that. Well, It’s embarrassing, but in the process to make a better benchmark I revised my results and discovered that the execution time I presented _already_ was considering the time to bring data back to the CPU. That also makes the memory coalescing problem small. Further tests are necessary.
      I’ve edited the post to say that, thanks a lot!

  16. Jon Nordby says:

    I’m looking forward to having the OpenCL interface for operations. Having this and the existing work would be a very successful GSOC I think. Good job so far!

    • Victor Oliveira says:

      Thanks a lot!
      I’m doing that right now and I expect to have some operations using it by the end of GSoC. But I intend to maintain this code, so it’s no problem :)

  17. Jon Nordby says:

    Does multiple operations (in serie) on the GPU side actually work at the moment? If not, what is required for it to work?

    • Victor Oliveira says:

      They work! The locking mechanism makes that tiles are transferred only when necessary. So if there is a chain of OpenCL operators, all of them will have CL locks and [hopefully] synchronization will happen only at the end of the chain.

  18. Nikolay Antonov says:

    offtopic:
    I just loock gimp and gegl sources, and it seems that all filters are single-threaded… why it doesn’t use all available kernels?

    • On the GEGL side of things,. for regular GEGL code that doesn’t use opencl there is an experimental option to use multiple threads, specified through the environment variable GEGL_THREADS, it works for most ops but has some concurrency issues resulting in broken renderings for others. This paralellization is done outside the operations by running multiple instances of the op (actually the full rendering graph) in parallel. This paralellization might not be desirable when driving OpenCL though, where CL itself might be doing the paralellization itself.

  19. Mikez says:

    Sorry dude, this is astoundingly slow for that hardware.

    Are you timing the allocations? That’s unnecessary since you only need to do that once. You should just do separate allocation of a tile cache anyway, or more likely detatch the tile nature of the input data from the processing: how else are you going to implement convolutions and the really interesting stuff? Tiles are good for storage, but you can’t really process using them.

    You also need to batch up much more work – 128×64 isn’t enough to even get it warmed up before it’s done with. The hardware is probably spending more time launching the jobs than executing them. Particularly if you’re loading the data synchronously per-tile.

    As a comparison I’ve been working on some OpenCL stuff and also an image editor. In Java, using JOCL for OpenCL. These don’t use tiles, so they forego that overhead, but it shouldn’t be all the difference.

    My Java image editor uses all threads on the cpu (its a 6 core/ht thing, so `12′, it is very fast admittedly) and can do a brightness/contrast and re-compose and display the image many times/second – and all mathematics are done using floats. Just the image composition (on which i have distinct timings), which takes a row of generated ‘checkerboard’ and a row of ‘image’ and blends them (about the same work as brightness/contrast) takes about 5ms for the whole 1024x1024xRBGA-float image.

    With OpenCL, I can easily take 720p video, copy it to the graphics card, do some opencl processing (much more than a brightness/contrast), copy it back to the cpu, pass it off to Swing, which then copies it back to the GPU to display it – at video frame rates with time to spare. GPU is a GTX 480 which afaict is about on par with the tesla.

    And tell your supervisor the only point in using OpenCL is speed, so optimization has to be on the table. It’s even more important with OpenCL – the difference between good code and simple code is not just 2-3x as with typical c, it can easily be 10-100x.

  20. Pingback: Poniedzielnik: wieści ze świata OpenSource. Numer 12 :: Czytelnia Ubuntu

  21. joao martins says:

    Is it work with Intel Sandy Bridge graphics (VA-API) ?

  22. scriptumplus says:

    500 ms is really, really, really, really slow result.
    50 ms for Tesla? Impossible result. I get 1 ms on GeForce 6600 hardware (without transfering) using pixel shader.

    • Victor Oliveira says:

      Exactly, the problem is with memory transferences and tiling.
      I’ve been working in a better version at: http://git.gnome.org/browse/gegl/?h=gsoc2011-opencl-2

      • RPG says:

        Victor, did you try to use shaders? Photoshop cs4/cs5 mostly uses shaders, so it works on hardware not supports CUDA and other.

        Shaders are very fast, PBO is very fast too, I believe you could get better results. Also shaders support most video cards from DirectX 9.0. Some operations (e.g. compositing) even do not need shaders.

        Also, most desktops have only one video card, and using one big tile (maximum size of framebuffer) is the best way. Video chips love big textures.

  23. kj says:

    Using shaders is limited only to GPU, when OpenCL do not makes such limitations and it runs on CPUs too. So In my opinion shaders are step backward in some way and require another attitude to problem.

    • RPG says:

      OpenCL is limited to GPU: almost 80% of all GPU’s doesn’t support it (includinlg ALL my machines). Some of these gpu’s are pretty fast: GeForce 6600, Intel GMA X4500, Intel GMA X3500 ant they cannot be used with OpenCL.

      Shaders support all GPU’s from GeForce FX.

  24. kj says:

    You’re wrong, OpenCL is NOT limited to GPU. I’ve successfully run OpenCL application on AMD, and Intel x86-64/amd64 architecture with noticable speed increase without optimalization. More of that, all NVIDA and AMD gpus starting from about 2009/10 support GPGPU, so can you give the source of information that 80% of gpus do not suppot OpenCL?

    http://developer.amd.com/sdks/AMDAPPSDK/pages/DriverCompatibility.aspx

    http://developer.nvidia.com/cuda-gpus

    I suppose that these are quite long lists.
    Beside that, i’m not an expert, but shaders are not intendet for GPGPU and programming is in my opinion harder, although i may be wrong.

  25. RPG says:

    OpenCL kernels are very, very similar to shaders. There is no problem to support both solutions (I could provide shader programs for gegl, some operations – BCI, blending – are easy to implement).
    I wanna make GIMP to run faster on _old_ machines. New machines are fast enough to work with current version of GIMP even without OpenCL.
    Intel GMA (30% of all PC’s!) could help much with filters and blending ops. Old GPU’s much, much faster than old CPU’s, they could help too. It is unfair to deprive them of the possibility of hardware acceleration, hardware acceleration is more in demand on older machines (IMO).
    And I can help with it, but I have to support from developers, I know shaders, but I don’t know gegl.

    • kj says:

      Great to hear that.
      I’m currently starting work on OpenCL GIMP plugin for test purposes for my msc degree, now i see i must consider if I could contribute to gegl work, the “only” limit is my time, or rather lack of time :) Maybe i’ll be able to combine this, if not there is still possibility to port my solution to gegl in free time….
      @Up: The problem only with supporting both solutions is that probralby they could require separate “framework” to cooperate with gegl, and support for switching between shader/ocl modes..

      • RPG says:

        I think it’s simple:
        1) Try to use OpenCL,
        2) Try to use openGL GLSL,
        3) Using software renderer.

        And I think, gegl already has this solution to switch OpenCL/Software.

        Some filters may not have OpenGL realisations or OpenCL implementation, just select one of them.

        I see, many students do useful work for GIMP, I’m student too and maybe I should try GIMP to get graduate school:-D (sorry, but I’m russian student and my country has different education terminology, and I do not know what does “msc” mean, I hope you understand me:).
        The main limitation for me is the lack of time, my main work is Linux systems. I can write filters using my OpenGL engine, because I know it, but I must to know gegl well to intergate shaders with it. If I find a solution I’ll present it.

  26. Victor Oliveira says:

    I’m afraid it’s not that simple at all :/

    Give a look at my repo and see how much code there is to implement a decent OpenCL support, now multiply that by 2.
    I think OpenCL is the way to go because:

    1. the API is much more cleaner that OpenGL (for GPGPU computing)
    2. CPU support, so we can use both GPU and CPU to process an image and the code can work even without a GPU
    3. soon we will have good open-source implementations for CPU and GPU

    Anyway, It’d be great if you guys help to test and give feedback in my code, it already works for some filters in GIMP.

    thanks.

  27. RPG says:

    Hello, Victor!
    Very glad to see the project in development. How about performance? Is there any improvements? Can this solution work on old GPUs (not too old to not support CUDA/OpenCL, but relatively different from Tesla)?

    I also saw a node editor project, that could be more suitable for OpenCL.

    • Hi RPG!
      So, the project has advanced a lot, I’ve basically rewritten a lot of things along the year… You can just set a gaussian blur in GEGL for example and see the difference.

      Here is a presentation I gave at LGM at the middle of the year about it:

      You can see 20x speedups for example, it’s good :)
      So, the code currently works on any OpenCL implementation, even for CPUs, of course, speedups depend on your GPU. For example, for development I’ve been using an integrated Intel GPU and it works just fine.

      I haven’t used the node editor for GEGL yet, but I intend to soon.
      Thanks for the interest in the project, I need testers for the OpenCL path in GEGL, so if you can just give a try, that’d be awesome :)

      Victor

      • RPG says:

        Must I get and build new version from git and so on, or I can try in in my GIMP 2.8 distribution?

        And is there any guide about OpenCL cernels development? I have some machines with new GPU’s, so I can make some tests and try to make some algorithms on holydays:)

        The new GSoC is coming. Do you need some help in development of OpenCL implementations?

      • You need to be able to create new GEGL filters, so I think the best would be to build it from source.
        About OpenCL development, the OpenCL reference pages are pretty good:

        http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/

        But I don’t know of any specific tutorial besides just googling it.

        You can give a look at some simple GEGL filters:
        - http://git.gnome.org/browse/gegl/tree/operations/common/brightness-contrast.c
        - http://git.gnome.org/browse/gegl/tree/operations/common/box-blur.c

        Next steps of the work are:

        - testing and solving bugs in different platforms (the code should be able to recover from errors in _any_ OpenCL call).
        - Simplifying the API, for point filters it’s great right now, but for more complicated filters it’s a mess as you can see.
        - improve gegl testing system to see if the output for OpenCL and normal code paths are the same.
        - implement all abyss policies from GEGL
        - OpenCL support for more filters (http://wiki.gimp.org/index.php/Hacking:Porting_filters_to_GEGL)
        - identify and solve remaining performance issues.

        A GSoC that I’d be happy to mentor can easily be derived from this list :)

      • RPG says:

        It seems that gegl is enough for filters development and testing, no GIMP needed? I’ll try to clone and build.

        There is a litle misunderstanding: I asked about your internal guides of OpenCL development, maybe you implemented an API to create kernels or some other. I see that you place all kernels inside *.c source – may it be replaced by dynamic kernels loading from files? About OpenCL itself – it is very similar to GLSL shaders, I hope there will be no problems for me:)

      • “may it be replaced by dynamic kernels loading from files?”

        This is something I intend to work on, It’d be better if opencl kernels sources are in separated files and loaded at runtime.

        About the API, you can see the brightness-constrast example and others.

        Most of the internal code is in:
        - gegl/opencl/*
        - gegl/buffer/gegl-buffer-cl-iterator.*
        - gegl/buffer/gegl-buffer-cl-cache.*
        - gegl/operation/gegl-operation-* (*_cl_process functions)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s