OpenCL on GEGL: Results up to now

Posted on 08/08/2011 by Victor Oliveira

Hello everyone! I’m glad to show you the results up to now of my GSoC project about adding OpenCL support to the General Graphics Library.

What I’ve done

GEGL has two basic data types:

GeglTile
GeglBuffer

A GeglBuffer can be seen as a layer in a image editing tool, they can be translated, cut, duplicated, etc. A final image is a composition of buffers. A buffer is composed by many GeglTiles, which are rectangular regions of pixels with same size, so pixel data like color is stored in tiles. This architecture is very flexible and allows for example that tiles may be stored in the disk, in a network or compacted.

What I want in my project is to be able to process tiles using an OpenCL device, like GPUs or even a multi-core CPU, the solution I implemented is that each tile has two states, the host memory data and a pointer to a OpenCL memory buffer and each one has its revision number which are used for synchronization.

This synchronization is achieved through locks. For example, suppose gegl_buffer_get is called for a buffer which tiles are being processed in the GPU. This function asks for buffer data to be copied to a pointer, as such, each buffer’s tile is going to be locked for reading, this locking process will verify the revision numbers and move data from the GPU to the CPU accordingly. The picture below illustrate this architecture:

An Example of Use

I’ll show an example of use of gegl buffer iterators to implement a Brightness-Contrast filter using OpenCL.

First, we define the OpenCL kernel that will be executed for each tile:

    const char* kernel_source[] =
    {
    "sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE |              \n",
    "                    CLK_ADDRESS_NONE            |              \n",
    "                    CLK_FILTER_NEAREST;                        \n",
    "__kernel void kernel_bc(__read_only  image2d_t in,             \n",
    "                        __write_only image2d_t out,            \n",
    "                         float brightness,                     \n",
    "                         float contrast)                       \n",
    "{                                                              \n",
    "  int2 gid = (int2)(get_global_id(0), get_global_id(1));       \n",
    "  float4 in_v  = read_imagef(in, sampler, gid);                \n",
    "  float4 out_v;                                                \n",
    "  out_v.xyz = (in_v.xyz - 0.5f) * contrast + brightness + 0.5f;\n",
    "  out_v.w   =  in_v.w;                                         \n",
    "  write_imagef(out, gid, out_v);                               \n",
    "}                                                              \n",
}

So, each tile is a OpenCL image2d_t type which can be read-only or write-only and be must be fetched through a sampler.

Now let’s see the iterator code:

    i = gegl_buffer_iterator_new (buffer_write, NULL, NULL, GEGL_BUFFER_CL_WRITE);
    index = gegl_buffer_iterator_add (i, buffer_read, NULL, NULL, GEGL_BUFFER_CL_READ);
    while (gegl_buffer_iterator_next (i))
      {
        GeglClTexture *in_tex  = i->cl_data[index];
        GeglClTexture *out_tex = i->cl_data[0];
        size_t global_worksize[2] = {i->roi[0].width, i->roi[0].height};

        CL_SAFE_CALL( errcode = gegl_clSetKernelArg(kernel, 0, sizeof(cl_mem),   (void*)&in_tex->data) );
        CL_SAFE_CALL( errcode = gegl_clSetKernelArg(kernel, 1, sizeof(cl_mem),   (void*)&out_tex->data) );
        CL_SAFE_CALL( errcode = gegl_clSetKernelArg(kernel, 2, sizeof(cl_float), (void*)&brightness) );
        CL_SAFE_CALL( errcode = gegl_clSetKernelArg(kernel, 3, sizeof(cl_float), (void*)&contrast) );

        CL_SAFE_CALL( errcode = gegl_clEnqueueNDRangeKernel(gegl_cl_get_command_queue(), kernel, 2,
                                                      NULL, global_worksize, NULL,
                                                      0, NULL, NULL) );
        CL_SAFE_CALL( errcode = gegl_clFinish(gegl_cl_get_command_queue()) );
      }

The key point here is the GEGL_BUFFER_CL_WRITE and GEGL_BUFFER_CL_READ flags passed to the iterator. They mean that writing and reading will be done through whatever OpenCL device [GPU or CPU] we’re using. This code just executes the kernel defined above for each tile.

Before entering the iteration in buffer_read and buffer_write, all data from buffer_read is copied to the GPU [of course, only if it is the most recent]. At the end, the OpenCL revision numbers from buffer_write’s tiles are bumped.

If after all that we do this:

gegl_buffer_get (buffer_write, 1.0, NULL, NULL, buf_write, GEGL_AUTO_ROWSTRIDE);

This means we want to copy buffer_write’s data to a pointer in the host memory, so we have to synchronize host and GPU data versions before that. So all buffer functions will always return the most recent data version and, at the same time, memory transferences will be made only if necessary.

Here is a fluxogram of what is happening in this code:

Full code

Performance Results

Running the Brightness-Contrast code with a 1 mega pixel image using a NVidia Tesla C2050 as OpenCL device and a Intel Xeon E5506 as comparison [just using one core, but the code uses SSE2].

Time of memory transferences to the GPU was considered in this benchmark [EDIT: this time considers transferring data back and forth between GPU and CPU].

CPU Elapsed time: 526 milliseconds
OpenCL Elapsed time: 483 milliseconds

Also, here is a chart from NVidia profiler showing how execution time was spent:

Almost 80% of total execution time has been spent in memory transferences to and from the GPU. This is a good result, because even with this overhead the results were reasonable. Consider that the typical use case of GEGL is doing many operations in sequence, so the ratio processing/memory transferences tends to be higher. In fact, the case present here is the worst-case.

Possible Improvements

There are a lot of things that can be done in order to increase current code speed:

Intercalate execution of tiles with memory transferences of others tiles

GPU hardware (at least modern NVidia GPUs) has separated units for processing and memory transferences, we can use this to intercalate tiles processing and copying.

Tiles sharing the same OpenCL memory buffer

There is a lot of overhead in allocating a GPU texture for each tile. which is typically small [128×64]. I think the best way to tackle this problem is allocating a big chunk of memory and using offsets in this chunk when processing [it’s impossible to have pointers to GPU memory], the problem is that GEGL is supposed to abstract the user this kind of stuff from the user. Another idea is to serialize execution by having a pool of textures which can be reused by tiles, this would be good also because GPU memory is smaller than Host memory in general, the direct mapping CPU<->GPU cannot stand in fact.

Multiple OpenCL Command Queues

Command Queues can be executed concurrently in the same device, the Fermi architecture from NVidia can run 16 kernels at the same time, for example. This can be used to solve the memory transference overhead also.

Next Steps

I have yet to finish the implementation of a operator interface for OpenCL and make some OpenCL operators in order to create a useful chain of processing only in the GPU.

As the time for a GSoC project is very limited, my mentor and I decided to let optimizations outside the project, but I intend to work on them as soon as I can 🙂

Conclusion

The use of locking in order to synchronize CPU and GPU data was the most challenging part of the implementation, but after extensive testing. I think it’s working now, though I took more time than I expected to make it run properly.

Moreover, results so far show that using OpenCL to speed up Gegl is feasible and very interesting, thought still there is some challenges to be tackled, the tiled structure of Gegl allows a lot of optimizations.

The Gegl OpenCL branch is here.

This entry was posted in gsoc. Bookmark the permalink.

53 Responses to OpenCL on GEGL: Results up to now

Thales Oliveira says:

08/08/2011 at 4:13 pm

That’s really interesting, you’ve done a nice job so far my friend, keep it up!

Reply
Michael Muré says:

09/08/2011 at 8:39 am

Really cool work !

Reply
k says:

14/08/2011 at 4:52 pm

Very cool 🙂

(btw, the card you tested on, would you recommend it for Linux-based GIMP users who want to take advantage of OpenCL when 3.0 arrives?)

Reply
- Victor Oliveira says:
  
  14/08/2011 at 5:10 pm
  
  I think any recent nvidia/amd card can do the job.
  The problem is more about OpenCL drivers in Linux, as there aren’t any free implementations yet.
  btw, the card I used is very expensive, but it’s massive. I used it because it’s the best one we have here in my lab. But as I said, any good card should do just fine 🙂
  
  Reply
Nikolay Antonov says:

15/08/2011 at 6:54 am

Nice work!
What kind of filters and instruments can get boost from OpenCL?
Can I use OpenCL-enabled GIMP now?

Reply
- Victor Oliveira says:
  
  15/08/2011 at 9:01 am
  
  It’s still an ongoing gsoc project, but you can get the current version in my gegl’s branch: http://git.gnome.org/browse/gegl/?h=gsoc2011-opencl.
  I haven’t implemented any GEGL operator in OpenCL yet, but all that have a “regular” memory access pattern can be boosted, like: color conversions, overlay of layers, gaussian blur, brightness-contrast, etc.
  
  Reply
John says:

15/08/2011 at 5:15 pm

Come on people… HURRY UP!! You didn’t release a stable version since 10/10/2010… ALMOST A YEAR. I’m tired of version 2.6.11 it crashes all the time, it made me loose lots of hours of work, the floating windows and palettes are a freaking unproductive nightmare, the support for graphic tablets is so poor, I can’t even use the mouse and the tablet at the same time!! And I’m also getting sick of that mediocre philosophy: “you can’t complain because it’s free”, Firefox is also free but users complain all they want.

Reply
- Sonic4Spuds says:
  
  16/08/2011 at 12:09 pm
  
  John,
  
  Your issue of not being able to use the tablet and mouse at the same time is an issue that only can be addressed by the operating system (Or I suppose, by some major changes in the software itself). Second, complaining to a GSOC student is not the person to complain to about trunk development. Third, if 2.6.11 is crashing you should file a bug report against it, instead of just complaining that it crashes.
  
  -Sonic
  
  Reply
Pingback: GIMP Single-Window Mode Almost Ready, Hardware Acceleration Planned | PHP World
Pingback: Ein-Fenster-Modus fast fertig, GPU-Beschleunigung geplant « dennis-dorsch.de | Webdesign | Programmierung | News | Leipzig
Pingback: Gimp меняет цикл подготовки релизов. Выпуск 2.8 с однооконным интерфейсом почти готов | AllUNIX.ru – Всероссийский портал о UNIX-системах
Sonic4Spuds says:

16/08/2011 at 12:12 pm

Thanks for the report on your progress, I have been looking for reports on the progress of the different projects for the last few months without success.

-Sonic

Reply
Pingback: Gimp меняет цикл подготовки релизов. Выпуск 2.8 с однооконным интерфейсом почти готов
Pingback: Gimp: Ein-Fenster-Modus fast fertig, GPU-Beschleunigung geplant « com-Nachrichten.de
Jon Nordby says:

17/08/2011 at 5:48 am

How is the performance like with bigger problem sizes? Say 10 MPix or 50 MPix?

Reply
Jon Nordby says:

17/08/2011 at 6:02 am

That 80% of the time is spent on memory means that is where the optimization needs to go.
For a 1 MPixel problem size, transferring all the data in one direction should ideally take less than 100 ms [1]. Right now it takes about 200 ms.
I suspect this poor memory transfer performance is due to the latency of the memory transfer that occurs for every tile being transferred separately. Have you considered coalescing the data transfer and processing of tiles?

1. 1.0 MPix, 4 channels, 4 bytes per channel = 128 MB. PCI Express 16x 2.0 bandwidth is max 8GB/s. 8000MB/s / 128MB = 62 ms

Reply
- Victor Oliveira says:
  
  17/08/2011 at 2:01 pm
  
  That’s exactly what’s happening. But to coalesce the memory transfer for many tiles means they should share the same buffer, which has the consequences I explained.
  Also, there is a problem for very large images where OpenCL fails in processing the image, I suspect it’s because there is a maximum number of textures [or descriptors, anyway], but it’s something I have to give a look after GSoC.
  As I said, this architecture where each tile has its GPU memory doesn’t scale well, we should think about other solution for that.
  
  Reply
- Mikez says:
  
  17/08/2011 at 9:26 pm
  
  “1. 1.0 MPix, 4 channels, 4 bytes per channel = 128 MB. PCI Express 16x 2.0 bandwidth is max 8GB/s. 8000MB/s / 128MB = 62 ms”
  
  Are you sure? 1M * 4 * 4 = 16M to me. You’re counting the bits, but PCI Express 16x 2.0 is 8Gbytes/sec or 64Gbits according to wikipedia.
  
  Actually your maths is wrong anyway, it isn’t 62ms, it’s capable of transferring 128MB of data, 62 times per second = 16ms. Or with 16MB that would be 2ms elapsed. That seems more inline with what i’ve seen.
  
  Reply
  - Jon Nordby says:
    
    18/08/2011 at 6:34 am
    
    Thanks for the correction. I must have been more than halfway asleep when I did those calculations. 2ms means the transfers are two orders of magnitude slower than ideal, which is indeed more expected results when doing such small memory transfers.
Jon Nordby says:

17/08/2011 at 6:10 am

What is the utilization of the GPU computational units when running your kernel?
I suspect it is very low as you are limited to the size of one tile. 128×64 is only 8k threads, which is pretty low for these devices.
This is less of a point right now than above comment as the processing only takes 20% of the time, but is something to be aware of in general.

Reply
- Victor Oliveira says:
  
  17/08/2011 at 2:04 pm
  
  I think there should be a way to “group” tiles for memory transfs and processing, this would also solve the coalescing problem. We could also increase tiles’ size, but this isn’t a good solution.
  
  Reply
  - Jon Nordby says:
    
    18/08/2011 at 6:36 am
    
    Yes, we would need this to make it perform. I guess the challenge is that the processing is currently very tile-centric in GEGL, but it should all be internal API that we can fix.
Jon Nordby says:

17/08/2011 at 6:18 am

I find it a bit tricksy of you to ignore the time it takes to copy back the tiles when giving the total execution time. Especially because you include this further down in the discussion (when talking about 80% memory transfer time).
If the point is to have several operations running on the GPU side, and you want numbers appropriate for that case, please use a benchmark that actually tests this instead of playing with the numbers (however pure your motives are) 🙂

Reply
- Victor Oliveira says:
  
  17/08/2011 at 2:55 pm
  
  Aha, I was expecting someone to point out that. Well, It’s embarrassing, but in the process to make a better benchmark I revised my results and discovered that the execution time I presented _already_ was considering the time to bring data back to the CPU. That also makes the memory coalescing problem small. Further tests are necessary.
  I’ve edited the post to say that, thanks a lot!
  
  Reply
Jon Nordby says:

17/08/2011 at 6:25 am

I’m looking forward to having the OpenCL interface for operations. Having this and the existing work would be a very successful GSOC I think. Good job so far!

Reply
- Victor Oliveira says:
  
  17/08/2011 at 2:57 pm
  
  Thanks a lot!
  I’m doing that right now and I expect to have some operations using it by the end of GSoC. But I intend to maintain this code, so it’s no problem 🙂
  
  Reply
Jon Nordby says:

17/08/2011 at 6:27 am

Does multiple operations (in serie) on the GPU side actually work at the moment? If not, what is required for it to work?

Reply
- Victor Oliveira says:
  
  17/08/2011 at 2:59 pm
  
  They work! The locking mechanism makes that tiles are transferred only when necessary. So if there is a chain of OpenCL operators, all of them will have CL locks and [hopefully] synchronization will happen only at the end of the chain.
  
  Reply
Nikolay Antonov says:

17/08/2011 at 12:57 pm

offtopic:
I just loock gimp and gegl sources, and it seems that all filters are single-threaded… why it doesn’t use all available kernels?

Reply
- Øyvind Kolås says:
  
  17/08/2011 at 4:12 pm
  
  On the GEGL side of things,. for regular GEGL code that doesn’t use opencl there is an experimental option to use multiple threads, specified through the environment variable GEGL_THREADS, it works for most ops but has some concurrency issues resulting in broken renderings for others. This paralellization is done outside the operations by running multiple instances of the op (actually the full rendering graph) in parallel. This paralellization might not be desirable when driving OpenCL though, where CL itself might be doing the paralellization itself.
  
  Reply
Mikez says:

17/08/2011 at 9:06 pm

Sorry dude, this is astoundingly slow for that hardware.

Are you timing the allocations? That’s unnecessary since you only need to do that once. You should just do separate allocation of a tile cache anyway, or more likely detatch the tile nature of the input data from the processing: how else are you going to implement convolutions and the really interesting stuff? Tiles are good for storage, but you can’t really process using them.

You also need to batch up much more work – 128×64 isn’t enough to even get it warmed up before it’s done with. The hardware is probably spending more time launching the jobs than executing them. Particularly if you’re loading the data synchronously per-tile.

As a comparison I’ve been working on some OpenCL stuff and also an image editor. In Java, using JOCL for OpenCL. These don’t use tiles, so they forego that overhead, but it shouldn’t be all the difference.

My Java image editor uses all threads on the cpu (its a 6 core/ht thing, so `12′, it is very fast admittedly) and can do a brightness/contrast and re-compose and display the image many times/second – and all mathematics are done using floats. Just the image composition (on which i have distinct timings), which takes a row of generated ‘checkerboard’ and a row of ‘image’ and blends them (about the same work as brightness/contrast) takes about 5ms for the whole 1024x1024xRBGA-float image.

With OpenCL, I can easily take 720p video, copy it to the graphics card, do some opencl processing (much more than a brightness/contrast), copy it back to the cpu, pass it off to Swing, which then copies it back to the GPU to display it – at video frame rates with time to spare. GPU is a GTX 480 which afaict is about on par with the tesla.

And tell your supervisor the only point in using OpenCL is speed, so optimization has to be on the table. It’s even more important with OpenCL – the difference between good code and simple code is not just 2-3x as with typical c, it can easily be 10-100x.

Reply
- Mikez says:
  
  17/08/2011 at 10:31 pm
  
  FWIW I made a mistake – the brightness/contrast code is only single threaded. I added some timing and it takes about 16ms to apply on a 1024×1024/RGBA/Float image.
  
  This is super-simple Java code and JDK 1.6/64 bit. CPU is an intel i7 980.
  
  http://code.google.com/p/mediaz/source/browse/trunk/ImageZ/src/imagez/fx/ui/BrightnessContrast.java#80
  
  (the maths is slightly wrong because i changed the alpha model, but you get the idea)
  
  I just can’t believe the tiling stuff is adding so much overhead: but it must be.
  
  Reply
Pingback: Poniedzielnik: wieści ze świata OpenSource. Numer 12 :: Czytelnia Ubuntu
joao martins says:

23/08/2011 at 11:07 am

Is it work with Intel Sandy Bridge graphics (VA-API) ?

Reply
scriptumplus says:

21/11/2011 at 5:08 am

500 ms is really, really, really, really slow result.
50 ms for Tesla? Impossible result. I get 1 ms on GeForce 6600 hardware (without transfering) using pixel shader.

Reply
- Victor Oliveira says:
  
  21/11/2011 at 9:19 am
  
  Exactly, the problem is with memory transferences and tiling.
  I’ve been working in a better version at: http://git.gnome.org/browse/gegl/?h=gsoc2011-opencl-2
  
  Reply
  - RPG says:
    
    21/11/2011 at 10:12 am
    
    Victor, did you try to use shaders? Photoshop cs4/cs5 mostly uses shaders, so it works on hardware not supports CUDA and other.
    
    Shaders are very fast, PBO is very fast too, I believe you could get better results. Also shaders support most video cards from DirectX 9.0. Some operations (e.g. compositing) even do not need shaders.
    
    Also, most desktops have only one video card, and using one big tile (maximum size of framebuffer) is the best way. Video chips love big textures.
kj says:

10/02/2012 at 10:09 pm

Using shaders is limited only to GPU, when OpenCL do not makes such limitations and it runs on CPUs too. So In my opinion shaders are step backward in some way and require another attitude to problem.

Reply
- RPG says:
  
  11/02/2012 at 12:48 am
  
  OpenCL is limited to GPU: almost 80% of all GPU’s doesn’t support it (includinlg ALL my machines). Some of these gpu’s are pretty fast: GeForce 6600, Intel GMA X4500, Intel GMA X3500 ant they cannot be used with OpenCL.
  
  Shaders support all GPU’s from GeForce FX.
  
  Reply
kj says:

11/02/2012 at 10:23 am

You’re wrong, OpenCL is NOT limited to GPU. I’ve successfully run OpenCL application on AMD, and Intel x86-64/amd64 architecture with noticable speed increase without optimalization. More of that, all NVIDA and AMD gpus starting from about 2009/10 support GPGPU, so can you give the source of information that 80% of gpus do not suppot OpenCL?
http://developer.amd.com/sdks/AMDAPPSDK/pages/DriverCompatibility.aspx
http://developer.nvidia.com/cuda-gpus
I suppose that these are quite long lists.
Beside that, i’m not an expert, but shaders are not intendet for GPGPU and programming is in my opinion harder, although i may be wrong.

Reply
- RPG says:
  
  11/02/2012 at 10:38 am
  
  http://unity3d.com/webplayer/hwstats/pages/web-2010Q2-gfxperf.html
  http://unity3d.com/webplayer/hwstats/pages/web-2010Q2-gfxcard.html
  Only 10-20% of all users have modern (from about 2009/10) GPU’s. 30% are Intel GMA (Intel GMA’s driver doesn’t support OpenCL or CUDA).
  
  Sad but true. For example, I have 4 machines, one with geForce 6600, three – Intel GMA.
  
  > programming is in my opinion harder, although i may be wrong.
  Harder for non-game developers:) Shaders are neat.
  
  Reply
  - kj says:
    
    11/02/2012 at 11:05 am
    
    These stats are for unity WEB player, so i can guess they do not reflect “global” situation,
    More stat is: http://unity3d.com/webplayer/hwstats/pages/web-2011Q3-gfxseries.html
    Its from latest months, not a year before, and if you take a look at GPU series, it does not look as bad.
    I’m not questioning that shaders are more available, but GPGPU video cards market share is growing quickly, so i think it’s better to invest time in this techlonogy, specially desinged for that kind of computation…
RPG says:

11/02/2012 at 11:23 am

OpenCL kernels are very, very similar to shaders. There is no problem to support both solutions (I could provide shader programs for gegl, some operations – BCI, blending – are easy to implement).
I wanna make GIMP to run faster on _old_ machines. New machines are fast enough to work with current version of GIMP even without OpenCL.
Intel GMA (30% of all PC’s!) could help much with filters and blending ops. Old GPU’s much, much faster than old CPU’s, they could help too. It is unfair to deprive them of the possibility of hardware acceleration, hardware acceleration is more in demand on older machines (IMO).
And I can help with it, but I have to support from developers, I know shaders, but I don’t know gegl.

Reply
- kj says:
  
  11/02/2012 at 11:42 am
  
  Great to hear that.
  I’m currently starting work on OpenCL GIMP plugin for test purposes for my msc degree, now i see i must consider if I could contribute to gegl work, the “only” limit is my time, or rather lack of time 🙂 Maybe i’ll be able to combine this, if not there is still possibility to port my solution to gegl in free time….
  @Up: The problem only with supporting both solutions is that probralby they could require separate “framework” to cooperate with gegl, and support for switching between shader/ocl modes..
  
  Reply
  - RPG says:
    
    11/02/2012 at 12:01 pm
    
    I think it’s simple:
    1) Try to use OpenCL,
    2) Try to use openGL GLSL,
    3) Using software renderer.
    
    And I think, gegl already has this solution to switch OpenCL/Software.
    
    Some filters may not have OpenGL realisations or OpenCL implementation, just select one of them.
    
    I see, many students do useful work for GIMP, I’m student too and maybe I should try GIMP to get graduate school:-D (sorry, but I’m russian student and my country has different education terminology, and I do not know what does “msc” mean, I hope you understand me:).
    The main limitation for me is the lack of time, my main work is Linux systems. I can write filters using my OpenGL engine, because I know it, but I must to know gegl well to intergate shaders with it. If I find a solution I’ll present it.
Victor Oliveira says:

14/02/2012 at 8:31 am

I’m afraid it’s not that simple at all

Give a look at my repo and see how much code there is to implement a decent OpenCL support, now multiply that by 2.
I think OpenCL is the way to go because:

1. the API is much more cleaner that OpenGL (for GPGPU computing)
2. CPU support, so we can use both GPU and CPU to process an image and the code can work even without a GPU
3. soon we will have good open-source implementations for CPU and GPU

Anyway, It’d be great if you guys help to test and give feedback in my code, it already works for some filters in GIMP.

thanks.

Reply
RPG says:

23/12/2012 at 6:58 pm

Hello, Victor!
Very glad to see the project in development. How about performance? Is there any improvements? Can this solution work on old GPUs (not too old to not support CUDA/OpenCL, but relatively different from Tesla)?

I also saw a node editor project, that could be more suitable for OpenCL.

Reply
- Victor Oliveira says:
  
  25/12/2012 at 1:20 pm
  
  Hi RPG!
  So, the project has advanced a lot, I’ve basically rewritten a lot of things along the year… You can just set a gaussian blur in GEGL for example and see the difference.
  
  Here is a presentation I gave at LGM at the middle of the year about it:
  
  You can see 20x speedups for example, it’s good 🙂
  So, the code currently works on any OpenCL implementation, even for CPUs, of course, speedups depend on your GPU. For example, for development I’ve been using an integrated Intel GPU and it works just fine.
  
  I haven’t used the node editor for GEGL yet, but I intend to soon.
  Thanks for the interest in the project, I need testers for the OpenCL path in GEGL, so if you can just give a try, that’d be awesome 🙂
  
  Victor
  
  Reply
  - RPG says:
    
    25/12/2012 at 1:48 pm
    
    Must I get and build new version from git and so on, or I can try in in my GIMP 2.8 distribution?
    
    And is there any guide about OpenCL cernels development? I have some machines with new GPU’s, so I can make some tests and try to make some algorithms on holydays:)
    
    The new GSoC is coming. Do you need some help in development of OpenCL implementations?
  - Victor Oliveira says:
    
    25/12/2012 at 2:40 pm
    
    You need to be able to create new GEGL filters, so I think the best would be to build it from source.
    About OpenCL development, the OpenCL reference pages are pretty good:
    http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/
    But I don’t know of any specific tutorial besides just googling it.
    
    You can give a look at some simple GEGL filters:
    – http://git.gnome.org/browse/gegl/tree/operations/common/brightness-contrast.c
    – http://git.gnome.org/browse/gegl/tree/operations/common/box-blur.c
    
    Next steps of the work are:
    
    – testing and solving bugs in different platforms (the code should be able to recover from errors in _any_ OpenCL call).
    – Simplifying the API, for point filters it’s great right now, but for more complicated filters it’s a mess as you can see.
    – improve gegl testing system to see if the output for OpenCL and normal code paths are the same.
    – implement all abyss policies from GEGL
    – OpenCL support for more filters (http://wiki.gimp.org/index.php/Hacking:Porting_filters_to_GEGL)
    – identify and solve remaining performance issues.
    
    A GSoC that I’d be happy to mentor can easily be derived from this list 🙂
  - RPG says:
    
    25/12/2012 at 3:09 pm
    
    It seems that gegl is enough for filters development and testing, no GIMP needed? I’ll try to clone and build.
    
    There is a litle misunderstanding: I asked about your internal guides of OpenCL development, maybe you implemented an API to create kernels or some other. I see that you place all kernels inside *.c source – may it be replaced by dynamic kernels loading from files? About OpenCL itself – it is very similar to GLSL shaders, I hope there will be no problems for me:)
  - Victor Oliveira says:
    
    25/12/2012 at 3:48 pm
    
    “may it be replaced by dynamic kernels loading from files?”
    
    This is something I intend to work on, It’d be better if opencl kernels sources are in separated files and loaded at runtime.
    
    About the API, you can see the brightness-constrast example and others.
    
    Most of the internal code is in:
    – gegl/opencl/*
    – gegl/buffer/gegl-buffer-cl-iterator.*
    – gegl/buffer/gegl-buffer-cl-cache.*
    – gegl/operation/gegl-operation-* (*_cl_process functions)
Pingback: Hardware-Beschleunigung und andere Pläne für GIMP 2.8 – Portable-World.de

	Hardware-Beschleunig… on OpenCL on GEGL: Results up to…
	linda on How to build GIMP 2.7.5 using…
	Victor Oliveira on OpenCL on GEGL: Results up to…
	RPG on OpenCL on GEGL: Results up to…
	Victor Oliveira on OpenCL on GEGL: Results up to…