Stream Compaction in WebGL

One of the big limitations when doing GPGPU computing is that since the architecture is designed to work on parallel, re arranging or compacting data is a non trivial task on the GPU, to overcome this limitation different stream compaction methods have been created, being histopyramids the algorithm that will be discussed in this post. Histopyramids are useful since it allows to allocate the data at the beginning of one texture, avoiding to evaluate complex shaders in fragments that don’t have useful information, that’s why this process is quite important making general computation using graphic cards.

The following image shows what is achieved with stream compaction. The texture represent a set of slices from a simulated 3D texture showing the different Z levels, in a 2D representation the data is scattered all over the texture, The colored line represents the same data compacted and displaying the 3D position of the voxels reallocated.


Implementing histopyramids is not a straightforward process, it requires two different steps which are reduction and traversal (compaction). The first phase, reduction, is the process where the algorithm generates a set of textures levels based on the original texture (base texture) holding the scattered data; the amount of textures to generate in the reduction process are based on the size of the original texture using the following equation.

Textures to Generate = Log2(original texture size). (for power of two textures) (1)

For the reduction process, a base texture has to be generated from the original data texture, the idea is to define a binary image where each position will be marked with a RGBA value of vec4(1.) in the binary new texture. This would define a transparent texture where the white texels represent the data’s position allocated in the scattered original texture.

This first step is quite similar as the mipmapping generation from the GPUs, but instead of using a median reduction for the generation of each texture, each pixel of the parent level of the pyramid is calculated using the sum of the four pixels contiguous of the leaf level texture. At the end of the reduction process a pyramid of images is created where each level represents the sum of the pixels from the leaf levels. The top level of the pyramid, a 1×1 texture represent the sum of all the data present in the base texture.

Once the reduction process is completed and the pyramid is generated, the traversal is required to compact the data. The total data to be allocated in the compacted texture is defined in the top level of the histo-pyramid (the 1×1 texture). Each one of the output elements of the resulting stream has to traverse the pyramid to finally obtain a UV position (texture coordinates). This texture position relates the output stream with a data value from the scattered texture.

The traversal is done using a key index for each texel from the output texture, meaning that each pixel would be associated with an incremental unique 1D index that would be used to compare against the different levels of the pyramid. Also each texel from the pyramid would be re interpreted as a key interval value.

This second step starts at the first level of the pyramid, checking if the key from the texel to evaluate is inside the total range of the data to compact. To do so the key value has to be lower than the total sum allocated at the top level of the pyramid. If the key is lower than the total range the traversal goes to the second level of the histopyramid, in the 2×2 texture, where each texel value can be represented as “a, b, c, d” and the following ranges can be generated:

A = [0, a), B = [a, a+ b), C = [a + b, a + b + c), D = [a + b + c, a + b + c + d)

Each traversal step evaluates where in the ranges the key index of the output stream can be allocated (or falls into), that range gives a 2D texture position for each level traversed. On each new level a new set of ranges have to be generated and a new UV position is extracted based on the range where the key falls into. It’s important to specify that the ranges are generated in a “Z” order from the four adjacent pixels.

In the final phase, the base texture is used and the ranges would be of a unit difference, meaning that this step gives a UV coordinate that can be used to read the corresponding data in the scattered texture and place that data in the output telex from the stream.

The previous image shows the traversal in two levels for the output texel with a key value of “4”, the sum of all the values is represented as the 1×1 texture which value is “9”, the first level represented with a texture of 2×2 has four values that defines the following ranges:

A = [0, 3), B = [3, 5), C = [5, 8), D = [8, 9)

The key evaluated falls in the B range, hence for the next level the algorithm should use the four values allocated in the [0, 1] position, in this level the new ranges are defined as:

A = [3, 3),  B = [3, 4), C = [4, 5), D = [5, 5)

In this case the key falls into the [2, 1] position of the base texture, since this texture is a 4×4 texture the corresponding UV coordinate would be [0.5, 0.25]. This coordinate would be used to read the data from that position in the scattered texture and write the read value inside the position relative to the key used to traverse the pyramid.

    Implementation of the Shaders required:

The implementation in webGL will be done using plain javascript, this allows to transfer the code to whatever library the user would like to use (three.js, regl, babylon.js). This tutorial also assumes a certain level of understanding of basic webGL operations like shaders compilation, quad generations and the use of the different tests (scissor, depth, stencil) that can be applied in the framebuffers. If the user feel the need to revisit these concepts the following URL ( is a good place to visit. First the required shaders will be discussed based on the algorithm previously explained, and then the javascript implementation will be addressed based on the programs generated.

Reduction Step:

Since all the important operations will be done in shaders, there are two main programs that have to be created, a reduction program and a traversal program. All the programs will be performed over a quad, meaning that the vertex shader is only responsible to generate the vertex positions of the quad and to provide a UV coordinate for each fragment in the fragment shader.

The vertex shader is defined with the following code:

//attribute for each vertex, in this case a simple index key.
attribute float aVI;
//the output UV coordinate for each fragment
varying vec2 vText;
void main(void) {
     float vI = aVI + 1.;
//Generate the position of the vertex based on its ID
     vec2 xy = vec2(mod(vI, 2.) == 0. ? 1. : -1., -1. + 2. * step(-vI, -2.1));
//Calculate the UV coordinate of the vertex to use for interpolation
     vText = xy * 0.5 + 0.5;
//Define the position of the vertex.
     gl_Position = vec4(xy, 0.0, 1.0);

The vertex shader evaluates all the data for the vertex on execution time, but the user can also provide attributes with positions and UV coordinates for each vertex if desired.

The fragment shader works similar to a mipmapping shader, hence four texels from the input texture have to be read and the resulting color would be the sum of those four values. This can be done with the following shader.

precision mediump float;
precision mediump sampler2D;
//input texture (previous level used to evaluate the new level)
//it must be a float point texture.
uniform sampler2D uPreviousLevel;
// 1. / size of the previous level texture.
uniform float uSize;
//uv coordinate of each fragment obtained from vertex shader.
varying vec2 vText;
void main(void) {
//This value is used to evaluate the adjacent pixel from the evaluated one
    float k = 0.5 * uSize;
//position of the texel to read from the input texture
    vec2 position = floor(vText / uSize) * uSize;
//base texel
    float a = texture2D(uPreviousLevel,  position + vec2(0., 0.)).r;
//texel to the right
    float b = texture2D(uPreviousLevel,  position + vec2(k, 0.)).r;
//texel on top
    float c = texture2D(uPreviousLevel,  position + vec2(0., k)).r;
//texel on the diagonal (top & right)
    float d = texture2D(uPreviousLevel,  position + vec2(k, k)).r;
//The frag color is the sum of the values read previouly
    gl_FragColor = vec4(a + b + c + d);

The fragment shader reads the adjacent information from the evaluated fragment and sums that data to obtain the final fragment color. Since this shader is applied to all the different levels of the pyramid during the reduction process, it requires to know the actual size of the texture evaluated, this is provided in the uSize uniform being more conveniently defined as the inverse of the size value.

For the traversal process there are two ways of addressing the problem, writing the traversal in the vertex shader or writing it in the fragment shader. Both ways require to know how many vertices of fragments would be required to traverse the pyramid, so this should be controlled using javascript instructions. Since a vertex shader for a quad generation is already defined this new program will make the traversal in the fragment shader using the following code.

precision lowp float;
precision lowp sampler2D;
//pyramid texture that holds all the levels previously generated in the reduction process
uniform sampler2D uPyramid;
//Original texture with the scattered data.
uniform sampler2D uBase;
//1x1 texture containing the total amount of data to compress.
uniform sampler2D uTotalData;
//UV coordinate for each fragment.
varying vec2 vText;
//Uniforms used for the execution on each level, defined in Javascript.
uniform vec4 uTexture3D;
void main(void) {
    float vVI = dot(floor(uTexture3D.x * vText), vec2(1., uTexture3D.x));
//This avoids to evaluate fragments that generate keys higher than the total
//amount of data to compress.
    if (vVI > texture2D(uTotalData, vec2(0.5)).r) discard;
    vec2 relativePosition = vec2(0.);
    vec2 position = vec2(0.);
    float start = 0.;
    float end = 0.;
    vec4 starts = vec4(0.);
  vec3 ends = vec3(0.);
    float offset = uTexture3D.x - 2.;
    float div0 = 1. / uTexture3D.y;
    float k = 1. / uTexture3D.x;
    float resta = 2.;
    vec2 pos1 = vec2(0.);
    vec2 pos2 = vec2(0.);
    vec2 pos3 = vec2(0.);
    vec2 pos4 = vec2(0.);
    vec3 mask = vec3(0.);
    vec4 m = vec4(0.);
//The traversal has to be done in all the levels of the pyramid, since the for loop requires
//a constant value a maximum of 14 levels can be evaluated. This allows to make a stream
//compaction for a 4096x4096 texture.
    for(int i = 1; i < 14; i++) {
//if the level is higher than the total levels that the pyramid has it breaks, this is required for
//pyramids with levels lower than 14.
        if(float(i) >= uTexture3D.w) break;
//The offset sets the texture reading to the position of the level to use in the pyramid texture.
        offset -= resta;
        resta *= 2.;
//Here the shader evaluates the three initial adjacent positions and calculates the
        relativePosition = position + k * vec2(offset, 0.);
        end = start + texture2D(uPyramid, relativePosition).r;
        pos1 = relativePosition;
        starts.x = start;
        ends.x = end;
        pos2 = relativePosition + vec2(k, 0.);
        starts.y = ends.x;
        ends.y = ends.x + texture2D(uPyramid, pos2).r;
        pos3 = relativePosition + vec2(0., k);
        starts.z = ends.y;
        ends.z = ends.y + texture2D(uPyramid, pos3).r;
//The fourth range doesn’t require to read the fourth pixel, if the key is not in the first three ranges
//it should be allocated in the fourth range
        pos4 = relativePosition + vec2(k, k);
        starts.w = ends.z;
        mask = vec3(greaterThanEqual(vec3(vVI), starts.rgb)) * vec3(lessThan(vec3(vVI), ends));
        m = vec4(mask, 1. - length(mask));
//the “m” value is a vector of four values which only one of those would be different than
//zero, to avoid if branching the four ranges are evaluated and the casted value, from
//boolean to float are used to define the new relative position (uv for the next level).
        relativePosition = m.x * pos1 + m.y * pos2 + m.z * pos3 + m.w * pos4;
        start = dot(m, starts);
//Sets the position for the next level traversal.
  position = 2. * (relativePosition - k * vec2(offset, 0.));
//The same traversal operation is done with the base texture instead of the pyramid texture.
    end = start + texture2D(uBase, position).r;
    pos1 = position;
    starts.x= start;
    ends.x = end;
    pos2 = position + vec2(k, 0.);
    starts.y = ends.x;
    ends.y = ends.x + texture2D(uBase, pos2).r;
    pos3 = position + vec2(0., k);
    starts.z = ends.y;
    ends.z = ends.y + texture2D(uBase, pos3).r;
    pos4 = position + vec2(k, k);
    starts.w = ends.z;
    mask = vec3(greaterThanEqual(vec3(vVI), starts.rgb)) * vec3(lessThan(vec3(vVI), ends));
    m = vec4(mask, 1. - length(mask));
    position = m.x * pos1 + m.y * pos2 + m.z * pos3 + m.w * pos4;
    vec2 index = position * uTexture3D.x;
//The color for this example is the 3D position relative to the 2D position of the UV obtained.
//This could also be defined as the data from the scattered texture that is allocated in the
//UV obtained (position).
//This could be done using gl_FragColor = texture2D(scatteredTexture, position);
    gl_FragColor = vec4(div0 * vec3(mod(index.y, uTexture3D.y), uTexture3D.z * floor(index.y * div0) + floor(index.x * div0), mod(index.x, uTexture3D.y)), texture2D(uBase, position).a);

The previous traversal shader requires 3 important inputs that are defined below:

– uTexture3D: this vector has four values that represent the following data:
* uTexture3D.x = size (length) of the base texture.
* uTexture3D.y = size of the 3d texture simulated in the 2d texture (ex. a 128x128x128 3d texture can be allocated in a 2048×2048 texture, the corresponding data for this value would be 128).
* uTexture3D.z = number of buckets of the 3D texture allocated on one side of the 2d texture. (based on the previous example the 2048 texture side can allocate up to 16 buckets of 128 x 128 layers).
* uTexture3D.w = quantity of levels generated in the reduction process.

Notice that this data is mostly used to evaluate the 3D position based on the UV position obtained in the final traversal (this traversal is adjusted for a marching cubes application), so on a generic case only two values would be required for this shader, the size of the base texture and the amount of levels generated in the reduction passes.

– uBase: this would be the base texture generated from the scattered data.

– uTotalData: this is the 1×1 level texture that defines the total data to compress.

– UPyramid: a single 2D texture that holds all the different levels generated from the base texture, since all the levels are defined as half of the size from the previous level. This texture is generated in javascript copying the data from each level, when it’s generated, to a single texture using the copyTexSubImage command.

Having a single pyramid texture avoids the use of different uniforms for each level, which also limits the shader to a defined amount of maximum levels based on those uniforms. The image below shows an example of a Pyramid texture generated in the reduction process.

Another advantage of this single image is that it allows to debug issues in the reduction process, since it’s straightforward to visualize any issues that could arise when reductions are generated.

The traversal process is done using a for loop which is limited by a constant upper bound since webGL only allows to use constants to define the loop, this shaders is limited to up to 14 traversal loops, hence textures up to 4096×4096 can be used in this shader. To avoid the use of if branching the traversal evaluates the first three ranges using vector masking and logical comparisons, the final range comparison can be evaluated as the difference between the total range of the whole loop and the first three partial ranges evaluated before, that avoids to evaluate another texture read.

Since the quantity of loops are the same one for each thread is the same, and there are no “if” branching inside each iteration, all the threads will execute the same amount of instruction ensuring that the shader won’t have any divergence.

The use of a unique pyramid texture requires to use an offset position that has to be updated for each new loop, this offset has to be included inside the UV coordinate obtained to traverse the next level. Since the base texture is not included in the pyramid texture, the last traversal is done outside the loop, this final step uses the base texture to define the final UV positions required.

The traversal is done in the fragment shader where each fragment has an unique 1D key defined by it’s position on the texture, the key generated is used for the traversal. If the user would like to write the parser in the vertex shader it’s important to define two attributes for the vertices, the first one would be the position of the vertex in the final texture and the second attribute would be the key for that vertex to use in the traversal process.

One important thing of this algorithm is that it doesn’t require scattering (or repositioning) of the data (vertices), hence all the final positions of the data will be known in advance. In that regard using the fragment shader for the traversal has the advantage of avoiding positioning the vertices with the vertex shader.

On a final note the user can observe that if the traversal shader is applied to the whole texture, there will be fragments that will generate keys outside of the total amount of data to compress To avoid this issue the level0 pyramid texture (1×1 texture) is provided as an uniform  using its value (the total data to compress) to compare against the key generated by the fragment’s position. If the key is higher than the value from the texture the fragment can be discarded. This could also be done with the final level allocated in the pyramid texture since the 1×1 texture is also copied into the pyramid.

Packing and unpacking of floating point values.

The last two programs are required to read floating point values in webGL using the gl.readPixel() command. This command only accepts 8 bit RGBA values to be read, hence the sum represented in the 1×1 texture can only be read packing the data from a floating point value to a RGBA 8 bit value that represents the 32 bits of the original data. To do so the following fragment shader is run over a 1×1 quad.

precision highp float;
precision highp sampler2D;
uniform sampler2D uPyT;
//Max cells that could be active in a texture
uniform float uMax;
varying vec2 vTextureUV;
void main(void) {
     vec4 enc = fract(vec4(1., 255., 65025., 160581375.) * texture2D(uPyT, vec2(0.)).r * uMax);
     enc -= enc.yzww * vec4(vec3(0.00392157), 0.);
     gl_FragColor = enc;

The previous packing shader is based on the information found at it also explains how to decode the RGBA value to a floating point meaningful information in Javascript.

Javascript implementation:

For the javascript implementation two helper functions are written to generate the corresponding required textures and frame buffers shown below

function createTexture(textureSize, format, maxFilter, minFilter, type, data) {
    var texture = gl.createTexture();
    texture.size = textureSize;
    gl.bindTexture(gl.TEXTURE_2D, texture);
    gl.texImage2D(gl.TEXTURE_2D, 0, format, textureSize, textureSize, 0, format, type, data);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, maxFilter);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, minFilter);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_WRAP_S, gl.CLAMP_TO_EDGE);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_WRAP_T, gl.CLAMP_TO_EDGE);
   return texture;
function createFramebuffer(texture) {
    var gl = gl;
    var frameData = gl.createFramebuffer();
    gl.bindFramebuffer(gl.FRAMEBUFFER, frameData);
    frameData.size = texture.size;
    var renderbuffer = gl.createRenderbuffer();
    gl.bindRenderbuffer(gl.RENDERBUFFER, renderbuffer);
    gl.renderbufferStorage(gl.RENDERBUFFER, gl.DEPTH_COMPONENT16, texture.size, texture.size);
    gl.framebufferTexture2D(gl.FRAMEBUFFER, gl.COLOR_ATTACHMENT0, gl.TEXTURE_2D, texture, 0);
 gl.framebufferRenderbuffer(gl.FRAMEBUFFER, gl.DEPTH_ATTACHMENT, gl.RENDERBUFFER, renderbuffer);
    return frameData;

This two functions are pretty straightforward, they just generate a texture with a clamp to edge filtering, allowing the user to define if the texture would be a floating point texture or an unsigned_byte, also if there’s data provided it’s passed to the texture during the creation of it. The second function allows to generate a frame buffer based on a texture provided, it has a color attachment for the texture and a depth attachment for z buffering if required.

With the previous function a set of textures are generated for the pyramid using the following code:

       tLevels = [];
       fbPyramid = [];
   tPyramid = createTexture(baseTextureSize, gl.RGBA, gl.NEAREST, gl.NEAREST, gl.float, true);
    for (var i = 0; i < Math.ceil(Math.log(baseTextureSize) / Math.log(2)); i++) {
        tLevels.push(createTexture(Math.pow(2, i), gl.RGBA, gl.NEAREST, gl.NEAREST, gl.float, true));

To read the data from the top level of the pyramid the following texture (unsigned_byte is used for this case to read it in Javascript using gl.readPixel)

    tRead = createTexture(1., gl.RGBA, gl.NEAREST, gl.NEAREST, gl.UNSIGNED_BYTE, true);
    fbRead = createFramebuffer(tRead);

Notice that the total levels to generate is based on equation (1), since the logarithmic function in Javascript is 10 based, the resulting value has to be divided by the constant “Math.log(2)” to change it to base 2. The baseTextureSize variable represents the length of the original scattered texture. The textures and frame buffers are saved in two arrays to be used in the compaction function.

tRead and fbRead makes reference to the texture and frambuffer used to make the packing of the floating point value present at the top level of the pyramid to be transferred to this texture using the RGBA 8bit encoding.

The compaction function is where the shaders will be executed, to so do three programs had to be generated beforehand using the shaders explained in the previous sections. the javascript application saves these programs using an array using string keys to invoke the corresponding program. The three required programs are:

“GENERATE_PYRAMID” program; this is generated using the quad shader and the reduction shader, it requires an ID attribute for the four vertex that generate the quad. It also requires to declare the needed uniforms for the reduction step, mainly the sampler used for the reduction, and the size of each sampler

“PACK_DATA” program: to compile this program the quad vertex shader is used and the packing shader is the corresponding fragment shader. It requires the same attribute than the previous program, and the uniforms are the 1×1 texture sampler input and a value representing the max value available to encode with the shader.

“PARSE_PYRAMID”: this last program uses the same vertex shader than the previous two ones and the fragment shader is the one that defines the traversal. The uniforms used are defined in the previous explanation for this shader.

To use the compaction function the three programs have to be generated, compiled and have their uniforms and attribute declarations ready. Pointers to the shaders uniforms are saved in the same array position that holds each program as a dynamic attribute, these pointers are used to send the data to the GPU on each invocation of the compaction function.

Compaction function

Once the textures are generated a compaction function is written to be used every time the application requires it, compaction can be done on every time the original data changes, hence it’s a good idea to have it on a helper function to call it whenever it’s needed. The following code shows the javascript calls to run the shaders.

function createPyramid(initialTexture, buffer, readCells, expectedCells) {
   var totalCells = expectedCells;
    //This part set the levels
    var levels = Math.ceil(Math.log(initialTexture.size) / Math.log(2));
    bindAttribBuffer(program["GENERATE_PYRAMID"].vertexIndex, this.bufferIndex, 1);
    var offset = 0;
    for (var i = 0; i < levels; i++) {
        gl.bindFramebuffer(gl.FRAMEBUFFER, this.fbPyramid[levels - i - 1]);
        var size = Math.pow(2, levels - 1 - i);
        gl.viewport(0, 0, size, size);
        gl.uniform1f(program["GENERATE_PYRAMID"].size, Math.pow(2, i + 1) / initialTexture.size);
        bindTexture(program["GENERATE_PYRAMID"].potentialTexture, i == 0 ? initialTexture : this.tLevels[levels - i], 0);
        gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);
        gl.bindTexture(gl.TEXTURE_2D, tPyramid);
        gl.copyTexSubImage2D(gl.TEXTURE_2D, 0, offset, 0, 0, 0, size, size);
        gl.bindTexture(gl.TEXTURE_2D, null);
        offset += size;
    //This part reads the total active cells
    if (readCells) {
        bindAttribBuffer(program["PACK_DATA"].vertexIndex, this.bufferIndex, 1);
        gl.uniform1f(program["PACK_DATA"].maxData, 1. / Math.pow(initialTexture.size, 2));
        bindTexture(program["PACK_DATA"].readPixelTexture, this.tLevels[0], 0);
        gl.bindFramebuffer(gl.FRAMEBUFFER, this.fbRead);
        gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);
        var pixels = new Uint8Array(4);
        gl.readPixels(0, 0, 1, 1, gl.RGBA, gl.UNSIGNED_BYTE, pixels);
        totalCells = Math.round(16448 * (pixels[0] + pixels[1] / 255 + pixels[2] / 65025 + pixels[3] / 160581375));
        console.log("counted cells: " + totalCells);
    //This part parses the pyramid for compaction
    bindAttribBuffer(program["PARSE_PYRAMID"].vertexIndex, this.bufferIndex, 1);
    bindTexture(program["PARSE_PYRAMID"].pyramid, this.tPyramid, 0);
    bindTexture(program["PARSE_PYRAMID"].base, initialTexture, 1);
    bindTexture(program["PARSE_PYRAMID"].totalData, tLevels[0], 2);
    gl.uniform4f(program["PARSE_PYRAMID"].gridPartitioning, this.texture3DSize, this.gridPartitions, this.sideBuckets, levels);
    gl.bindFramebuffer(gl.FRAMEBUFFER, buffer);
    gl.viewport(0, 0, baseTextureSize, baseTextureSize);
    var height = Math.ceil(totalCells / baseTextureSize);
    gl.scissor(0, 0, baseTextureSize, height);
    gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);
    return totalCells;

The previous function required the initial texture to be compacted, the frame buffer where the result will be saved and an expected amount of the data to process if the user doesn’t want to use the gl.readPixels method to accelerate the process. The function itself reflects the two steps of the algorithm (reduction and compaction) in two different program uses, in between the user can decide to read the amount of data in the original texture using the readCells boolean variable, defining a third optional step (in between the reduction and compaction steps).

The first step, reduction, invokes a set of quad draw calls on the different pyramid textures using the reduction program, for each iteration a new size is calculated and the corresponding uniforms are send. After the new level is generated the end result is copied to the pyramid texture using the copySubTexImage method, using an offset to separate each level of the pyramid in the same texture.

This first step doesn’t require to use a depth test, but since the function can be run on each frame is important to clear the color buffer for each level. Notice that the program used for this step is the “GENERATE_PYRAMID” program.

Once the pyramid is generated an optional second step is used to determine the amount of data allocated in the scattered texture, to do so the “PACK_DATA” program is used, and the unpacking of the data is done in Javascript as shown in this article

Reading this data in javascript represents the bottleneck of this function, since this requires to synchronize the CPU with the GPU respond, meaning that the application can suffer an idle state while waiting for this information. This means that if the user opts to read the data on every frame the total framerate of the application can be hurt just trying to have a precise amount of data to evaluate after the compaction process.

To overcome this limitation an expected amount of data is defined as part of the arguments of the function to supply a way to avoid the use of the gl.readPixels method. The idea is that since the GPU can read the data in the traversal shader any fragment that doesn’t require to be traversed will be discarded, and for future steps the user can safely assume that the data processed is below the expectance defined by the user.

The traversal process invokes the “PARSE_PYRAMID” program and it generates a quad draw using the scissor test to avoid any operations on fragments that are outside of the range of data to compact. To calculate the region of fragments for the scissor an area defined as the size of the texture and the ratio between the totalCells divided by the size of the texture is used, if the total cells are not read using the second step then the expected amount of cells is used. With the scissor test and the discard method the traversal function is accelerated.

The function returns the total cells read in the second step, or the expected cells defined by the user (the value gets unmodified), it also saves the compacted data in the provided frame buffers ready to be used for subsequent GPGPU operations.

The previous function used to helper functions to bind textures to the corresponding program and to bind attributes too, these functions are defined below.

function bindAttribBuffer(attribLocation, buffer, bufferDataSize) {
    gl.bindBuffer(gl.ARRAY_BUFFER, buffer);
    gl.vertexAttribPointer(attribLocation, bufferDataSize, gl.FLOAT, false, 0, 0);
function bindTexture(programData, texture, texturePos) {
    var textures = [gl.TEXTURE0, gl.TEXTURE1, gl.TEXTURE2, gl.TEXTURE3, gl.TEXTURE4, gl.TEXTURE5, gl.TEXTURE6, gl.TEXTURE7, gl.TEXTURE8, gl.TEXTURE9, gl.TEXTURE10, gl.TEXTURE11, gl.TEXTURE12, gl.TEXTURE13, gl.TEXTURE14];
    gl.bindTexture(gl.TEXTURE_2D, texture);
    gl.uniform1i(programData, texturePos);

    Shaders optimizations using vec4 histopyramids

The previous shaders are a straightforward implementation from the algorithm explained before, but there is a quite nice improvement that can be done in the traversal shader to jump from three texture reads on each level to only one; this is done using vec4 histopyramids. The following image illustrates the modifications to the algorithm in order to implement this variant.

These kind of pyramids make the reduction in a different manner, since the GPU allows to work with four channels per pixel the reduction shader can be rewritten in order to save the partial sums of the four pixels read instead of the total sum done before. This can be seen in the following fragment shader.

precision mediump float;
precision mediump sampler2D;
uniform sampler2D uPreviousLevel;
uniform float uSize;
varying vec2 vText;
void main(void) {
    float k = 0.5 * uSize;
    vec2 position = floor(vText / uSize) * uSize;
    float a = texture2D(uPreviousLevel,  position + vec2(0., 0.)).r;
    float b = texture2D(uPreviousLevel,  position + vec2(k, 0.)).r;
    float c = texture2D(uPreviousLevel,  position + vec2(0., k)).r;
    float d = texture2D(uPreviousLevel,  position + vec2(k, k)).r;
    gl_FragColor.a = a;
    gl_FragColor.b = a + b;
    gl_FragColor.g = gl_FragColor.b + c;
    gl_FragColor.r = gl_FragColor.g + d;

In the previous shader the program saves the partial sum on each channel of the final color, this allows to read the ranges using only the parent pixel, which means that the traversal only requires to allocate one texel from the texture level to traverse. Also the ranges can be evaluated faster since once the end of each range is defined creating the start values of the ranges is trivial. The following shader shows the implementation of the traversal with only one texture read.

precision lowp float;
precision lowp sampler2D;
uniform sampler2D uPyramid;
uniform sampler2D uBase;
varying vec2 vText;
uniform vec4 uTexture3D;
void main(void) {
    float start = 0.;
    vec4 starts = vec4(0.);
    vec4 ends = vec4(0.);
    float offset = uTexture3D.x - 2.;
    float div0 = 1. / uTexture3D.y;
    float k = 1. / uTexture3D.x;
    float diff = 2.;
    float vVI = dot(floor(uTexture3D.x * vText), vec2(1., uTexture3D.x));
    vec4 m = vec4(0.);
    vec2 relativePosition = k * vec2(offset, 0.);
    vec2 position = vec2(0.);
    vec4 partialSums = texture2D(uPyramid, relativePosition);
    for(int i = 1; i < 13; i++) {
        if(float(i) >= uTexture3D.w) break;
        offset -= diff;
        diff *= 2.;
        relativePosition = position + k * vec2(offset, 0.);
        vec2 pos1 = relativePosition;
        vec2 pos2 = relativePosition + vec2(k, 0.);
        vec2 pos3 = relativePosition + vec2(0., k);
        vec2 pos4 = relativePosition + vec2(k, k);
        ends = partialSums.wzyx + vec4(start);
        starts = vec4(start,;
        m = vec4(greaterThanEqual(vec4(vVI), starts)) * vec4(lessThan(vec4(vVI), ends));
        relativePosition = m.x * pos1 + m.y * pos2 + m.z * pos3 + m.w * pos4;
        start = dot(m, starts);
        position = 2. * (relativePosition - k * vec2(offset, 0.));
        partialSums = texture2D(uPyramid, relativePosition);
    vec2 pos1 = position;
    vec2 pos2 = position + vec2(k, 0.);
    vec2 pos3 = position + vec2(0., k);
    vec2 pos4 = position + vec2(k, k);
    ends = partialSums.wzyx + vec4(start);
    starts = vec4(start,;
    m = vec4(greaterThanEqual(vec4(vVI), starts)) * vec4(lessThan(vec4(vVI), ends));
    position = m.x * pos1 + m.y * pos2 + m.z * pos3 + m.w * pos4;
    vec2 index = position * uTexture3D.x;
    gl_FragColor = vec4(div0 * vec3(mod(index.y, uTexture3D.y), uTexture3D.z * floor(index.y * div0) + floor(index.x * div0), mod(index.x, uTexture3D.y)), texture2D(uBase, position).a);

Notice that the traversal shader doesn’t change too much, the texture fetchs and the ranges evaluation are based on a new variable, “partialSums”. This variable reads the partial sums from the parent level so there’s no need to make multiple texture fetching. Finally these two shaders can replace the previous ones with the same javascript implementation.

In the next post a marching cubes implementation for WebGL will be explained using this compaction method.

Bibliography: (original paper from Gernot Ziegler et al) (pdf with explanations Gernot Ziegler et al) (pdf containing the vec4-histopyramid description)