### GPU Marching Cubes from Point Clouds in WebGL (part 2: marching cubes steps)

15, Jan 2017/Categories 3D, Sin categoría, webGL/No Comments

The author assumes that the reader understands the implementation of the marching cubes algorithm in the CPU, also has knowledge on how to perform a stream compaction over a texture using histopyramids. Please refer to http://paulbourke.net/geometry/polygonise/ and http://www.miaumiau.cat/2016/10/stream-compaction-in-webgl/ for more information about these subjects.

Marching Cubes Steps:

The marching cubes algorithm can be separated in two steps which are the active voxels allocation and the vertices/normals generation. The first step uses a iso level value that defines a surface from the potential, where the active voxels are the ones that are intersected by the corresponding generated surface. To check the intersections all the edges from each voxel are evaluated.

An intersection is obtained if the potential value from one of the corners of the edge is below than the iso level and the other corner’s potential value is above. The different intersections represent one of the 256 possible combinations defined in the marching cubes implementation. These combinations, expressed in tables, help to know the total amount of vertices to generate, and the edges where these vertices would be allocated.

The second step is responsible to generate the vertices positions and normals in the active voxels, this is done taking into account the quantity of vertices/triangles required for the case obtained in the previous step. This pass allocates the edge where the the vertex will be placed, and sets the vertex position with a linear interpolation between the two corners from the edge using the iso level as an interpolator.

To apply the previous two steps explained it’s important to translate the different tables in the implementation from Paul Bourke into two textures.

The first texture saves the 15 possible vertices to generate for each of the 256 combinations presented in the marching cubes, the data can be obtained from the “triTable” found here. This two dimensional data table is expanded in a one dimensional array containing 3840 values (notice that 16th value from each dimension is always -1, so this last one is always discarded), and then allocated inside a 64×64 2D floating point texture. Each fragment saves the corresponding edge where the vertex would be generated in the active voxel where values with -1 represent that no vertices need to be created for that edge. The resulting texture is shown below.

The previous texture is a color encoded visualization from the original floating point texture, each square represents a position where the color defines the edge where the vertex will be allocated. Notice that every 15 squares a new combination is defined, and for each combination the colored squares represent the vertices to generate, and the black ones the “-1” values meaning that there’s no more data needed for that combination.

The second texture to create is also based from the “triTable” used before, but with a different interpretation, the idea is to save the amount to triangles to generate on each one of the cases, these 256 values are saved inside a 16×16 2D floating point texture where all the channels contain the same value; the amount of triangles to generate for each case can be obtained counting the values that are different from -1 and dividing the result by three. The following texture is shows the visual representation of the data.

Active Voxels Evaluation:

With the two previous textures it’s time to evaluate the active voxels, to do so the 3D expanded texture is used to check a condition where each corner of the voxels is tested to evaluate if the corresponding value is below the iso level required. The implementation from Paul Bourke uses bit masking to define which case is used based on a 12 bit number; in the current implementation, a floating point sum is used to represent the case that defines the triangles to generate since GLSL doesn’t allow bit manipulation. The process is done drawing a quad where the fragment shader evaluates all the voxels defined in the 2D texture to evaluate, doing so the whole 3D voxel space is evaluated at once. The following code executes the active voxels search.

precision mediump float;
precision mediump sampler2D;
//Texture container the potential values in the corners of the voxels.
uniform sampler2D uCornersTexture;
//Texture holding the numbers of triangles to generate on each of the 256 cases from the MC
uniform sampler2D uTrianglesIndex;
//Iso level used to define the surface required from the potential.
uniform float uRange;
varying vec2 vText;
//Constants used to simulate the 3D texture in a 2D texture
const vec4 C_3D = vec4(1. / 4096., 256., 16., 1. / 16.);
//Function used to evaluate the 2D index from a 3D position.
vec2 index2D(vec3 pos) {

return C_3D.x * (pos.xz + C_3D.y * vec2(mod(pos.y, C_3D.z), floor(pos.y * C_3D.w)) + vec2(0.5));

}
void main(void) {

//Obtain the 3D voxel position of the corresponding fragment to evaluate.
vec2 position = floor(vText / C_3D.x);
vec3 pos3D = vec3(mod(position.y, C_3D.y), C_3D.z * floor(position.y / C_3D.y) + floor(position.x / C_3D.y), mod(position.x, C_3D.y));
//The MC case to use in the voxel evaluated is calculated as the sum of corners that are below the iso level required.
float c = step(texture2D(uCornersTexture, index2D(pos3D)).r, uRange)
+ 2. * step( texture2D(uCornersTexture, index2D(pos3D + vec3(1., 0., 0.))).r, uRange)
+ 4. * step( texture2D(uCornersTexture, index2D(pos3D + vec3(1., 1., 0.))).r, uRange)
+ 8. * step( texture2D(uCornersTexture, index2D(pos3D + vec3(0., 1., 0.))).r, uRange)
+ 16. * step( texture2D(uCornersTexture, index2D(pos3D + vec3(0., 0., 1.))).r, uRange)
+ 32. * step( texture2D(uCornersTexture, index2D(pos3D + vec3(1., 0., 1.))).r, uRange)
+ 64. * step( texture2D(uCornersTexture, index2D(pos3D + vec3(1., 1., 1.))).r, uRange)
+ 128. * step( texture2D(uCornersTexture, index2D(pos3D + vec3(0., 1., 1.))).r, uRange);
c *= step(c, 254.);
//The total triangles to generate are obtained knowing which one of the 256 cases is required and reading the
//amount triangles from the 16×16 texture provided.
float totalTrianglesToGenerate = texture2D(uTrianglesIndex, vec2(mod(c, 16.), floor(c / 16.)) / 16.).r;
//The resulting fragment saves the amount of triangles to generate and the MC case obtained.
gl_FragColor = vec4(vec3(totalTrianglesToGenerate * 3.), c);

}

The previous shader works over all the fragments generated on the quad, where its UV position is used as a key to obtain the 3D position of the voxel simulated in the texture 2D, this is done using the data from the C_3D constants previously explained.

The sampler uTrianglesIndex is the second texture commented at the beginning of the post. Notice that the indexing is based on the actual marching cubes case obtained from the corresponding voxel/fragment. The variable “uCornersTexture” represents the values of each voxel’s corners potential saved in a texture3D, this is generated from the point cloud animation and it’s evaluated against a iso level defined with the uniform named uRange.

The shader starts finding the 3D position for each fragment, then it evaluates the corresponding corners and checks if the value is below the required iso level (uRange), if the test passes the case variable is modified to finally define which case from the 256 combinations is the one to implement, this is saved in the “c” variable. Also notice that since the case 0 and 255 represent the case where all the vertices are below or over the iso level no triangles will be generated, hence the C value is defined as zero.

Once the marching cubes case is defined the uTrianglesIndex sampler is used to obtain the total amount of vertices to generate for the corresponding voxel. Finally the output of the shader is generated encoding the amount of triangles in the RGB channel and the marching cubes case in the alpha channel. The output of this program is a texture3D which has the active voxels scattered on a 2D texture, this means that a stream compaction pass should be applied to the output of this shader to calculate the corresponding vertices only in the active cells. The result of this step can be seen in the following image.

The same shader can me modified to write a second debugging texture that shows the amount of triangles allocated per voxel, this would give a visual representation of how things are working in the first step. The idea is to use multiple render targets and rewrite the output the shader as follows:

//The resulting fragment saves the amount of triangles to generate and the MC case obtained.
gl_FragData[0] = vec4(vec3(totalTrianglesToGenerate * 3.), c);

//Color implementation to check the amount of triangles required inside each voxel
gl_FragData[1] = vec4(0.);
if(totalTrianglesToGenerate == 1.) gl_FragData[1] = vec4(0., 0., 1., 1.);
if(totalTrianglesToGenerate == 2.) gl_FragData[1] = vec4(0., 1., 0., 1.);
if(totalTrianglesToGenerate == 3.) gl_FragData[1] = vec4(1., 0., 0., 1.);
if(totalTrianglesToGenerate == 4.) gl_FragData[1] = vec4(0., 1., 1., 1.);
if(totalTrianglesToGenerate == 5.) gl_FragData[1] = vec4(1., 0., 1., 1.);

The second output represent a color visualization of the amount of triangle to generate, color convention is defined below:

– blue: one triangle.
– green: two triangles.
– red: three triangles.
– blue + green (aquamarine): four triangles.
– blue + red (purple): five triangles.

The image below shows the visual representation obtained from the active cells step, notice that most of the cells require two triangles per voxel (green color), meaning that the algorithm would be calculating a median of six vertices per active voxel.

Stream Compaction and vertices generation:

Once the active voxels are allocated and each case is defined for them, a stream compaction method should be applied to the resulting texture avoiding the generation of triangles over non active voxels. To do so the current implementation uses the expansion ability of the histopyramids allowing to allocate offsets of one particular value in the output stream during the compaction process.

The compaction process is partially based on this post. In the implementation explained in the previous article one initial base texture is generated from the data texture to compress. This base texture is based on a binary image where each position represent where the data is allocated. Inserting the binary texture in the reduction process the algorithm returns the amount of data (values) in the texture.

Instead of using a binary texture, the base texture would be the result from the active voxels step, where each “active “ fragment represents the amount of vertices required for each active voxel. Using this texture in the reduction process allows to know the total vertices required to generate.

Once the reduction process obtains the total amount of vertices, the compaction reorganize the texture from the active voxels step, generating contiguous offsets of the same voxel information as many times as vertices are defined for each voxel. The following image shows the difference between a compaction of a binary texture and the texture obtained from the active voxels step.

Since the compaction process also returns a unique key for each offset, the result represents the memory required to allocate the amount of vertices to generate, arranged by voxels and indexed by local voxels offsets. This means that the same shader can be used the generate the vertices and normals using the marching cubes algorithm.

The new compaction/expansion and vertices/normals shader is shown below.

#extension GL_EXT_draw_buffers : require
precision highp float;
precision highp sampler2D;
//Pyramid texture containing all the reduction steps.
uniform sampler2D uPyramid;
//Active voxels texture.
uniform sampler2D uBase;
//Potential corners values.
uniform sampler2D uPot;
//triTable data from Paul Bourke 64×64 texture
uniform sampler2D uTrianglesIndexes;
//level 0 of the pyramid.
uniform sampler2D uTotal;
//Uniform used to defined normals quality
uniform bool uFastNormals;
//iso level used to define the surface from the potential
uniform float uRange;
varying vec2 vText;
const float EPSILON = 0.00001;
//Inverse of the size from the 3D texture 256^3
const float p = 1. / 256.;
//Corners for any given voxel.
const vec3 p0 = vec3(p, 0., 0.);
const vec3 p1 = vec3(p, p, 0.);
const vec3 p2 = vec3(0., p, 0.);
const vec3 p3 = vec3(0., 0., p);
const vec3 p4 = vec3(p, 0., p);
const vec3 p5 = vec3(p, p, p);
const vec3 p6 = vec3(0., p, p);
//Constants used to simulate the 3D texture
const vec4 C3D = vec4(4096., 256., 16., 1. / 16.);
//Calculate the 2D index to read the data in the 2D texture based in a 3D position
vec2 index2D(vec3 pos) {

return (pos.xz + C3D.y * vec2(mod(pos.y, C3D.z), floor(pos.y / C3D.z)) + vec2(0.5)) / C3D.x;

}
void main(void) {

//Evaluate the 1D index of the fragment evaluated.
float vI = dot(floor(C3D.x * vText), vec2(1., C3D.x));
//If the fragment’s key is higher than the total of vertices needed to create the execution is halted.
else {

/*
*This is the compaction process, it’s explained in http://www.miaumiau.cat/2016/10/stream-compaction-in-webgl/
*/
float offset = C3D.x – 2.;
float k = 1. / C3D.x;
vec2 relativePosition = k * vec2(offset, 0.);
vec4 partialSums = texture2D(uPyramid, relativePosition);
float start = 0.;
vec4 starts = vec4(0.);
vec4 ends = vec4(0.);
float div0 = 1. / C3D.y;
float diff = 2.;
vec4 m = vec4(0.);
vec2 position = vec2(0.);
vec4 vI4 = vec4(vI);
//12 steps are required to parse the different levels of the pyramid.
for(int i = 1; i < 12; i++) {
offset -= diff;
diff *= 2.;
relativePosition = position + k * vec2(offset, 0.);
ends = partialSums.wzyx + vec4(start);
starts = vec4(start, ends.xyz);
m = vec4(greaterThanEqual(vI4, starts)) * vec4(lessThan(vI4, ends));
relativePosition += m.y * vec2(k, 0.) + m.z * vec2(0., k) + m.w * vec2(k, k);
start = dot(m, starts);
position = 2. * (relativePosition – k * vec2(offset, 0.));
partialSums = texture2D(uPyramid, relativePosition);

}
ends = partialSums.wzyx + vec4(start);
starts = vec4(start, ends.xyz);
m = vec4(greaterThanEqual(vI4, starts)) * vec4(lessThan(vI4, ends));
position += m.y * vec2(k, 0.) + m.z * vec2(0., k) + m.w * vec2(k, k);

/*
* MARCHING CUBES TO GENERATE THE VERTICES
* POSITIONS AND NORMALS
*/
//This data contains the 2D position of the voxel reallocated, the index offset for the vertex to generate in the corresponding voxel
//and the MC case used for that voxel.
vec4 data = vec4(position * C3D.x, vI – dot(m, starts), texture2D(uBase, position).a);
//Up to 15 vertices can be generated per voxel, the current vertex to generate is saved in this variable
float currentVertex = data.b;
//Calculate the 3D position of the voxel based on the 2D position in the scattered data.
data.xyz = p * vec3(mod(data.y, C3D.y), C3D.z * floor(data.y * p) + floor(data.x * p), mod(data.x, C3D.y));
//Obtain the one dimensional index to read the corresponding edge to use.
float mcIndex = 15. * data.a + currentVertex;
//Obtain the edge to use from the voxel using the previous index and reading the data from the triTable texture.
vec4 mcData = texture2D(uTrianglesIndexes, vec2(mod(mcIndex, 64.), floor(mcIndex / 64.)) / 64.);
mcIndex = mcData.r;
//To obtain the two points that define the edge the original implementation uses a set of if conditionals.
//The shader makes a sum of all the corners using masks to discard the values that are not needed.
vec4 m0 = vec4(mcIndex);
//Check if values are in the edge
vec4 m1 = vec4(equal(m0, vec4(0., 1., 2., 3.)));
vec4 m2 = vec4(equal(m0, vec4(4., 5., 6., 7.)));
vec4 m3 = vec4(equal(m0, vec4(8., 9., 10., 11.)));
//The two points are the voxel position plus the point active using the mask calculated before.
vec3 b0 = data.rgb + m1.y * p0 + m1.z * p1 + m1.w * p2 + m2.x * p3 + m2.y * p4 + m2.z * p5 + m2.w * p6 + m3.y * p0 + m3.z * p1 + m3.w * p2;
vec3 b1 = data.rgb + m1.x * p0 + m1.y * p1 + m1.z * p2 + m2.x * p4 + m2.y * p5 + m2.z * p6 + m2.w * p3 + m3.x * p3 + m3.y * p4 + m3.z * p5 + m3.w * p6;
//Potential values in the corresponding corners
vec4 n0 = texture2D(uPot, index2D(C3D.y * b0));
vec4 n1 = texture2D(uPot, index2D(C3D.y * b1));
vec2 diff1 = vec2(uRange – n0.a, n1.a – n0.a);
//Value used to evaluate the linear interpolation between the two corners points to define the position of the vertex to generate.
vec3 mult = vec3(lessThan(abs(vec3(diff1.x, uRange – n1.a, -diff1.y)), vec3(0.)));
vec3 normalA = vec3(0.);
vec3 normalB = vec3(0.);
/*
*or applying a 3D Sobel kernel.
*/
if(uFastNormals) {

vec2 deltaX = index2D(C3D.y * (b0 + vec3(p, 0., 0.)));
vec2 deltaY = index2D(C3D.y * (b0 + vec3(0., p, 0.)));
vec2 deltaZ = index2D(C3D.y * (b0 + vec3(0., 0., p)));
normalA = normalize(-vec3(n0.a – texture2D(uPot, deltaX).a, n0.a – texture2D(uPot, deltaY).a, n0.a – texture2D(uPot, deltaZ).a));

deltaX = index2D(C3D.y * (b1 + vec3(p, 0., 0.)));
deltaY = index2D(C3D.y * (b1 + vec3(0., p, 0.)));
deltaZ = index2D(C3D.y * (b1 + vec3(0., 0., p)));
normalB = normalize(-vec3(n1.a – texture2D(uPot, deltaX).a, n1.a – texture2D(uPot, deltaY).a, n1.a – texture2D(uPot, deltaZ).a));

} else {

//If more smooth gradients are required, a higher order Sobel operator is used to calculate them.
//this gives a more smoothed surface at the expense of less performance.
float op = 1.;
float scaler = 1.;
vec3 S1A_X = vec3(1., op, 1.);
vec3 S2A_X = vec3(op, scaler * op, op);
vec3 S3A_X = vec3(1., op, 1.);
vec3 S1B_X = vec3(0.);
vec3 S2B_X = vec3(0.);
vec3 S3B_X = vec3(0.);
vec3 S1C_X = vec3(-1., -op, -1.);
vec3 S2C_X = vec3(-op, -scaler * op, -op);
vec3 S3C_X = vec3(-1., -op, -1.);
vec3 S1A_Y = vec3(1., op, 1.);
vec3 S2A_Y = vec3(0., 0., 0.);
vec3 S3A_Y = vec3(-1., -op, -1.);
vec3 S1B_Y = vec3(op, scaler * op, op);
vec3 S2B_Y = vec3(0., 0., 0.);
vec3 S3B_Y = vec3(-op, -scaler * op, -op);
vec3 S1A_Z = vec3(-1., 0., 1.);
vec3 S2A_Z = vec3(-op, 0., op);
vec3 S3A_Z = vec3(-1., 0., 1.);
vec3 S1B_Z = vec3(-op, 0., op);
vec3 S2B_Z = vec3(-scaler * op, 0., scaler * op);
vec3 S3B_Z = vec3(-op, 0., op);
vec3 f1A = vec3(texture2D(uPot, index2D(C3D.y * (b0 + vec3(-p, p, p)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(0., p, p)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(p, p, p)))).a);
vec3 f2A = vec3(texture2D(uPot, index2D(C3D.y * (b0 + vec3(-p, 0., p)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(0., 0., p)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(p, 0., p)))).a);
vec3 f3A = vec3(texture2D(uPot, index2D(C3D.y * (b0 + vec3(-p, -p, p)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(0., -p, p)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(p, -p, p)))).a);
vec3 f1B = vec3(texture2D(uPot, index2D(C3D.y * (b0 + vec3(-p, p, 0.)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(0., p, 0.)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(p, p, 0.)))).a);
vec3 f2B = vec3(texture2D(uPot, index2D(C3D.y * (b0 + vec3(-p, 0., 0.)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(0., 0., 0.)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(p, 0., 0.)))).a);
vec3 f3B = vec3(texture2D(uPot, index2D(C3D.y * (b0 + vec3(-p, -p, 0.)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(0., -p, 0.)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(p, -p, 0.)))).a);
vec3 f1C = vec3(texture2D(uPot, index2D(C3D.y * (b0 + vec3(-p, p, -p)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(0., p, -p)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(p, p, -p)))).a);
vec3 f2C = vec3(texture2D(uPot, index2D(C3D.y * (b0 + vec3(-p, 0., -p)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(0., 0., -p)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(p, 0., -p)))).a);
vec3 f3C = vec3(texture2D(uPot, index2D(C3D.y * (b0 + vec3(-p, -p, -p)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(0., -p, -p)))).a, texture2D(uPot, index2D(C3D.y * (b0 + vec3(p, -p, -p)))).a);
float xGradient = dot(f1A, S1A_X) + dot(f2A, S2A_X) + dot(f3A, S3A_X) + dot(f1B, S1B_X) + dot(f2B, S2B_X) + dot(f3B, S3B_X) + dot(f1C, S1C_X) + dot(f2C, S2C_X) + dot(f3C, S3C_X);
float yGradient = dot(f1A, S1A_Y) + dot(f2A, S2A_Y) + dot(f3A, S3A_Y) + dot(f1B, S1B_Y) + dot(f2B, S2B_Y) + dot(f3B, S3B_Y) + dot(f1C, S1A_Y) + dot(f2C, S2A_Y) + dot(f3C, S3A_Y);
float zGradient = dot(f1A, S1A_Z) + dot(f2A, S2A_Z) + dot(f3A, S3A_Z) + dot(f1B, S1B_Z) + dot(f2B, S2B_Z) + dot(f3B, S3B_Z) + dot(f1C, S1A_Z) + dot(f2C, S2A_Z) + dot(f3C, S3A_Z);
f1A = vec3(texture2D(uPot, index2D(C3D.y * (b1 + vec3(-p, p, p)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(0., p, p)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(p, p, p)))).a);
f2A = vec3(texture2D(uPot, index2D(C3D.y * (b1 + vec3(-p, 0., p)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(0., 0., p)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(p, 0., p)))).a);
f3A = vec3(texture2D(uPot, index2D(C3D.y * (b1 + vec3(-p, -p, p)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(0., -p, p)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(p, -p, p)))).a);
f1B = vec3(texture2D(uPot, index2D(C3D.y * (b1 + vec3(-p, p, 0.)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(0., p, 0.)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(p, p, 0.)))).a);
f2B = vec3(texture2D(uPot, index2D(C3D.y * (b1 + vec3(-p, 0., 0.)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(0., 0., 0.)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(p, 0., 0.)))).a);
f3B = vec3(texture2D(uPot, index2D(C3D.y * (b1 + vec3(-p, -p, 0.)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(0., -p, 0.)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(p, -p, 0.)))).a);
f1C = vec3(texture2D(uPot, index2D(C3D.y * (b1 + vec3(-p, p, -p)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(0., p, -p)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(p, p, -p)))).a);
f2C = vec3(texture2D(uPot, index2D(C3D.y * (b1 + vec3(-p, 0., -p)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(0., 0., -p)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(p, 0., -p)))).a);
f3C = vec3(texture2D(uPot, index2D(C3D.y * (b1 + vec3(-p, -p, -p)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(0., -p, -p)))).a, texture2D(uPot, index2D(C3D.y * (b1 + vec3(p, -p, -p)))).a);
xGradient = dot(f1A, S1A_X) + dot(f2A, S2A_X) + dot(f3A, S3A_X) + dot(f1B, S1B_X) + dot(f2B, S2B_X) + dot(f3B, S3B_X) + dot(f1C, S1C_X) + dot(f2C, S2C_X) + dot(f3C, S3C_X);
yGradient = dot(f1A, S1A_Y) + dot(f2A, S2A_Y) + dot(f3A, S3A_Y) + dot(f1B, S1B_Y) + dot(f2B, S2B_Y) + dot(f3B, S3B_Y) + dot(f1C, S1A_Y) + dot(f2C, S2A_Y) + dot(f3C, S3A_Y);
zGradient = dot(f1A, S1A_Z) + dot(f2A, S2A_Z) + dot(f3A, S3A_Z) + dot(f1B, S1B_Z) + dot(f2B, S2B_Z) + dot(f3B, S3B_Z) + dot(f1C, S1A_Z) + dot(f2C, S2A_Z) + dot(f3C, S3A_Z);

}

//Save the vertex position, uses a linear interpolation between the two corners points of the edge and the iso value required.
gl_FragData[0] = vec4(mult.x * b0 + mult.y * b1 + mult.z * b0 + (1. – dot(mult, mult)) * mix(b0, b1, (diff1.x) / (diff1.y)), mcData.b);
//Save the vertex normal, calculate as the ponderated median of the two normals from the corners.
gl_FragData[1] = vec4(normalize(- (n0.a * normalA + n1.a * normalB) / max(n0.a + n1.a, EPSILON)), mcIndex);

}

}

This shader is divided into two steps, the vertices compaction/expansion process (highlighted in blue) and the vertex position and normal generation for each one of the vertices reallocated (highlighted in green). This fragment shader is also run over a quad that represents the 2D representation of the 3D voxel space. The uniforms for the programs are explained below.

uPyramid: input texture that contains all the different levels of the pyramid generation, is created while doing the reduction process of the histopyramids.

uBase: input texture containing the active cells result from the previous step.

uPot: input texture containing the potential values of the corners from the voxels of the 3D world, calculated in the corners step.

uTrianglesIndexes: input texture containing the edges where the vertices will be allocated, it’s based on the triTable from the Marching Cubes implementation of Paul Bourke.

uTotal: input texture of 1×1 containing the total amount of vertices to generate, calculated during the reduction process, this data could also be read using the uPyramid texture.

uFastNormals: input boolean value used to define if the normals are calculated using forward differences or a Sobel kernel.

uRange: iso level used to define the surface required from the potential.

The shader starts evaluating if the one dimensional index key of the fragment to evaluate is lower than the total amount of vertices required, if it’s higher the vertex is discarded. This is done to avoid useless calculations over the quad. If the fragment’s key is below the shader keeps the execution.

The compaction process reallocate the vertices needed for each active voxels in the fragments that pass the first condition, it uses the masks explained in here using the optimizations for the reduction process and the compaction step.

The result for this first step is the corresponding 2D position of the voxel allocated in the active voxels texture, this 2D position will be repeated as many times as vertices are needed for the corresponding voxel in the subsequent fragments. Once this position is obtained the shader continues to the second part, the positions and normals generation.

The second part of the shader starts calculating the 3D position of the voxel based on the 2D position from the input texture, and it reads the marching cubes combination obtained in the active voxels step, finally it calculates the key offset (vertex number) of the fragment relative to the corresponding voxel.

With those values calculated the next step is to obtain the corresponding edge where the vertex will be allocated, this is done calculating an index used to read the data from the triTable texture containing the edges. The index is saved in the mcIndex variable, and the corresponding edge saved inside the variable called mcData.

Paul Bourke’s original implementation uses a set of if conditionals to check which edge to use depending on the vertex to calculate inside the voxel. In the current implementation the shader uses masks to discard the edges that are not required inside sums of all the possible corners and the position of the voxel. These sums returns the positions of the edge’s corners saved in the variables b0 and b1.

The corresponding potential values for these corners are read in the corners texture using b0 and b1 as input indexes, the results are saved in the n1 and n1 variables, then two more variables “diff1” and “mult” are saved to perform the linear interpolation that handles the position of the vertex over the edge.

With the gradients calculated and the potentials of the corners obtained, the vertex normal and position is evaluated as a linear interpolation of the corner’s positions using the corresponding potentials datas and the iso surface value to make the interpolation. The shader outputs two textures with the compacted data for positions and normals shown below.

Vertex positions generated from the last step.

Vertex normals generated from the last step, encoded in the [0-1] space for visualization.

Notice that the current implementation recalculates vertices for adjacent triangles, meaning that much of the information from the last two textures is repeated. In order to implement the third method commented in http://http.developer.nvidia.com/GPUGems3/gpugems3_ch01.html a second histopyramid would be required.

The idea is to compact the information containing only the edges 0, 3 and 8 from the active cells step (with the second histopyramid) and calculate the corresponding data (vertex positions and normals for 0, 3 and 8 edges) with the last shader explained. The indexes can be calculated associating the original texture obtained in the active cells step, and the one with the 0, 3 and 8 edges.

The output of this implementation would be the positions, normals and indexes textures.

Tests done with this approach show that the current hystopyramid implementation becomes the bottleneck of the application in terms of time execution, hence requiring a second one for the indexes brings more penalties than benefits for performance. With a faster stream compaction method the idea of generating indexes would improve the performance of the algorithm in general.

Divergence and Branching:

The reader will notice that the last shader uses two conditional branches, one to discard the fragments where their 1D positional keys are higher than the total amount of vertices to generate, and a second one for the selection of the gradient type.

The gradient branching don’t present any divergence overt the threads since the instructions will be the same ones for each one of them, that’s because all the threads use the same boolean (uniform) passed to the shader.

The discard branching will present divergence on few warps/wavefronts since the data is compacted, this means that mostly each warp/wavefront containing 32 threads will execute the same code for all the fragments that generates positions and normals. On the other hand all the warps/wavefronts containing fragments with higher indexes would be grouped and discarded at the same time.

Divergence will be observed in the warps/wavefronts where there’s a group of fragment discarded, and the rest keep the execution (position/normal calculation), the different paths will force the warp to deal with both control paths and show a 50% performance loss on those warps. These warps would be allocated in the close to horizontal limit that separates the data generated and the empty data in the texture
This is one of the main reasons why data shouldn’t be calculated in a scattered fashion over a texture, in that case every warp could potential show divergence and performance would drop for each one of them, compaction limits the amount of warps where divergence happen.

Divergence can also be presented in the color implementation for the active cells evaluation, this is because the different colors are evaluated using a different set of if conditionals based on the result of each fragment’s different data, but since this is only meant for debugging no further optimizations where done in this section of the shader.

If the user would like to have a non divergence implementation for the color definition, it’s recommended to use float converted boolean masks on a weighted sum of the colors expected, it would represent more instructions, but all the threads will execute the same instructions at the same time.

More information about GPU pipeline, warps, branching and divergence can be found in the following articles https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline and http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/lecture5.pdf

The last part of these series will explain the javascript implementation of the shaders explained, and will provide different examples with different models at 128^3 and 256^3 resolutions.

### GPU Marching Cubes from Point Clouds in WebGL (part 1: Potential Generation)

10, Jan 2017/Categories 3D, Sin categoría, webGL/No Comments

In order to render fluids in webGL from particles simulations, an implicit surface method has to be used to create a mesh from a point cloud that represents the current state of the simulation on each frame. To do so there are many algorithms that can be applied, being the Marching Cubes a good approach for webGL since its implementation can take advantage of histopyramids to accelerate the process in the GPU.

This algorithm is an iso surface extraction method from a potential field that uses a divide and conquer process locating the surface inside a user defined 3D grid structure (voxels), generating triangles where the iso surface intersects the voxels of the 3D space. The intersection of the iso surface with each voxel can be defined by the cut of each of the edges from the cube that represents the voxel.

The algorithm checks the value of the two vertices that define the edge in order to check if one of the edges from the voxel has been crossed by the surface. If the two vertices are below or above the iso surface value the edge hasn’t been crossed, on the other hand if one of the vertices is below and the other vertex is above, the edge has been crossed by the iso surface.

Each crossed edge from the voxel generates a vertex position that will be part of one of the triangles to create in the corresponding cube, and since there are a limited amount of edges a finite amount of triangles positions can be defined, being a total of 256 possible combinations present in the algorithm with up to five triangles per voxel. The final vertex position between the two borders is evaluated based on a linear interpolation of the potential value from the two vertex of the edge, using the iso surface value as the input parameter for the interpolation.

The first image (taken from http://paulbourke.net/geometry/polygonise/) shows the vertices and edge numbering convention used for the marching cubes, and the second image shows how one voxel is crossed by the iso surface in the edges 11, 2 and 3, generating a set of three vertices in the edges commented, the triangle generated is allocated on those three vertices obtained.

Since the calculation of each vertex position can be done individually the process can be easy parallelized using the GPU. Also the total amount of vertices to generate can be obtained in advance, this allows to allocate the required memory to generate the geometry, since the process relays heavily on tables that define the amounts of vertices per voxel.

A common approach to implement the marching cubes algorithm in the GPU is to use geometry shaders since the method is expected to generate a set of vertices from a single voxel query, this represents a big limitation in webGL because the current pipeline only supports vertex and fragment shaders. To overcome this impediment histopyramids can provide an effective way to allocate the required vertices needed to create the triangles.

The complete process for the surface generation can be separated in three different blocks, which are the potential generation, active voxels evaluation and finally the generation of the corresponding vertices positions with its normals for the triangles to render. This three blocks are defined in the following diagram.

3D Voxels Compacted Texture Generation

The marching cubes algorithm requires a potential field that has to be generated from the particle cloud. This is done using a separable blur that spreads the particle data generating the required potentials, the resulting blur creates a simulated non signed distance field blurring in the 3 main axes where the coefficients and the the radius of the blur control the smoothness of the surface.

Since the blurring process is a quite demanding step it can become the bottleneck of the surface generation, it all depends on the size of the 3D texture used to allocate the voxels where the particles reside. In order to improve the performance of the blurring process a 3D compacted texture is generated, using the RGBA channels of the texture to compress the different depth buckets of a conventional 3D texture.

The idea is to take advantage of the vector capabilities of the GPU to place all the voxel data in the four channels of each fragment, that would allow to blur the 3D texture on a smaller texture size. As an example a 256^3 texture can be simulated on a 4096×4096 non compressed 2D texture, where the space will be segmented in 256 buckets of 256×256 pixels.

The previous texture can be compressed using a 2048×2048 2D texture where the “R” channel holds the 256×256 size buckets which depths are between [0-64), the “G” channel holds the buckets which depths are between [64-128), the “B” channel holds the buckets which depths are between [128-192) and the “A” channel holds the buckets between depths allocated in [192-256).

The image above shows two examples of a 256^3 compacted 3D textures. A simple inspection shows 64 (8X8) buckets of 256×256 pixels, where the depths can be read counting the bucket position and the corresponding RGBA color displayed. Notice that the same 2D position can represent more than one bucket (left 3D texture), in this case the white pixels represent that there’s a voxel with data for all the four depths encoded in those fragments.

A scattering program is written to populate the 3d texture where the vertex shader is responsible to allocate the particles inside the texture, and the fragment shader only assigns the corresponding color to each fragment. The following shaders are used to generate the required program

//2D index to allocate the position of the particle in the corresponding texture.
attribute vec2 aVertexIndex2D;
//Texture holding the positions of the particles from the simulation.
uniform highp sampler2D uPositionTexture;
//Size of the particle to draw in the 3d texture.
uniform float uSize;
//Displacement on the main axis to simulate the size of the particle in 3D
uniform float uDesface;
varying vec2 vV2I;
varying vec3 vPos;
//Constants used to simulate the 3D texture
const vec4 c3D = vec4(1. / 2048., 256., 8., 1. / 8.);
varying vec4 vColor;
void main(void) {

vV2I = aVertexIndex2D;
//obtain the position from the particle in [0-1] space.
vPos = texture2D(uPositionTexture, aVertexIndex2D).rgb;
//Define the 3D position in voxel space [0-256].
vec3 gPP = floor(vPos * c3D.y);
//Displace the main axis position to simulate a cube particle in 3D
gPP.y -= uDesface;
//Obtain the corresponding 2D position for the 2D texture.
vec2 gP = c3D.x * (gPP.xz + c3D.y * vec2(mod(gPP.y, c3D.z), floor(gPP.y * c3D.w)) + vec2(0.5));
//This defines the corresponding depth of the particle, it’s used to encode it in the
//corresponding channel.

float zLevel = floor(gP.y);
gP.y = fract(gP.y);
gP = 2. * gP – vec2(1.);
//The color of the channel depends on the corresponding depth of the particle.
vColor = vec4(equal(vec4(zLevel), vec4(0., 1., 2., 3.)));
//Set the size of the particle to render.
gl_PointSize = uSize;
gl_Position = vec4(gP, 0., 1.0);

}

precision highp float;
varying vec4 vColor;
void main(void) {

gl_FragColor = vColor;

}

The vertex shader uses the attribute aVertexIndex2D to define the UV required to read the 3d position of each particle represented as a vertex, these positions are saved in the uPositionTexture uniform. The uniform uSize is used to define the size of the particle in the voxel space, this is useful when there’re not too many particles in the simulation, or the geometry to generate is based on filaments generated from particles.

The constant c3D is a vector that is used to allocate the data in buckets that simulate the array of depths from a 3D texture. The corresponding channels are defined below:
c3D.x: inverse of the size of the texture 2D used to simulate the texture 3D (usually 1024 or 2048).
c3D.y: bucket side size for the texture3D simulated (usually 128 from 128^3 or 256 from 256^3)
c3D.z: number of buckets allocated in the horizontal row from the 2D texture. Usually 8 or 16 buckets.

Since the particles can have sizes bigger than one pixel, it’s required to represent them as cubes in the 3D texture. To do so the vertex shader uses the uOffset uniform, this value represents the displacement in the depth axis (in units) that the shader should offset to render the particle. This would represent the cube using the different slices to represent the size in 3D.

This means that the program has to be invoked as many times as the size of the particles, incrementing the value of uOffset on each pass to render the corresponding slices of the particle’s cube in the 3D texture.

One depth value is saved in the variable zLevel to define which channel will be used, since only 64 buckets can be represented per channel the zValue will define a value from [0-4). If the zValue is 0 the corresponding voxel will be encoding in the “R” channel, for a zValue of 1 the “G” channel is used, for zValue equal of 2 the “B” channel is used. Finally for a zValue of 3 the “A” channel holds the corresponding voxel.

3D Blur for Potential Field Generation

Once the 3D texture is filled with the particles a second program is required to generate the potentials using a 3D blur. This is done using a quad pass, and two different fragment shaders, one for the blur corresponding the non depth axis, and another fragment shader for the blur corresponding the depth axis. The fragment shader for the depth axis is defined below.

precision highp float;
precision highp sampler2D;
uniform sampler2D uDataTexture;
uniform float uSteps;
varying vec2 vText;
//strength of the pixel to blur agains the other ones used.
const float coeficient = 2.;
const vec4 c3D = vec4(1. / 2048., 256., 8., 1. / 8.);
const float border = 1.;
//Evaluates the 2d index value in the simulated 3D texture from the 3D pos.
vec2 index2D(vec3 pos) {

return c3D.x * (pos.xz + c3D.y * vec2(mod(pos.y, c3D.z), floor(pos.y * c3D.w)) + vec2(0.5));

}
void main(void) {

//Obtain the 3D pos of the corresponding fragment.
vec2 pos = floor(vText / c3D.x);
vec3 pos3D = vec3(mod(pos.y, c3D.y), c3D.z * floor(pos.y / c3D.y) + floor(pos.x / c3D.y), mod(pos.x, c3D.y));
vec3 newPos3D = vec3(0.);
vec2 uv = vec2(0.);
vec4 blend = vec4(0.);
float zLevel = 0.;
//Obtain the z level for the corresponding fragment.
float currentZLevel = floor(pos3D.y / 64.);
for (float i = 0.; i <= 40.; i ++) {
if(i > 2. * uSteps) break;
float j = i – uSteps;
//Obtain the new 3D pos of the fragment to use for blurring.
newPos3D = pos3D + j * vec3(0., 1., 0.);
//Obtain the z level for the new fragment to read.
zLevel = floor(newPos3D.y / 64.);
uv = index2D(newPos3D);
uv.y = fract(uv.y);
float k = j == 0. ? coeficient : 1.;
vec4 newBucket = texture2D(uDataTexture, uv);
//If the new fragment is in the same depth range than the original fragment to blur then the same channels are used.
//If the new zLevel is different than the current Z level the blurring have to be done taking into account the
//channel differences between the two fragments.

vec3 cases = vec3(bvec3(zLevel < currentZLevel, zLevel == currentZLevel, zLevel > currentZLevel));
blend += k * (vec4(0., newBucket.rgb) * cases.x + newBucket * cases.y + vec4(newBucket.gba, 0.) * cases.z);

}
blend /= (2. * uSteps + coeficient);
//This avoids to spread information between the different buckets.
blend *= float(mod(pos.x, 256.) > border && mod(pos.y, 256.) > border && mod(pos.x, 256.) < 255. - border && mod(pos.y, 256.) < 255. - border); gl_FragColor = blend;

}

Fragment shader for two non depth coordinates blur:

precision mediump float;
precision mediump sampler2D;
uniform sampler2D uDT;
uniform vec3 uAxis;
uniform float uSteps;
varying vec2 vText;
const float coeficient = 2.;
void main(void) {

vec4 blend = vec4(0.);
for (float i = 0.; i <= 40.; i ++) {
if(i > 2. * uSteps) break;
float j = i – uSteps;
float k = j == 0. ? coeficient : 1.;
blend += k * texture2D(uDT, vText + j * uAxis.xy);

}
blend /= (2. * uSteps + coeficient);
gl_FragColor = blend;

}

Since the 3D texture is simulated using contiguous buckets in the 2D texture, the blur for the depth axis has to take into account the depth range for every fragment used in the blurring process, hence a vector is defined to know if the depth of the fragment to use is in the range of the depth from the fragment to blur. There are three different scenarios depending on the two depth values (fragment to blur “currentZLevel” and fragment to use for blurring “zLevel”), these are defined below.

if zLevel is equal than the currentZlevel, the original channels are blended with the new channels with the following equation blend += newBucket.rgba.
If zLevel is below the currentZlevel, the original channels are blurred with the new data using the following equation… blend += vec4(0, newBucket.rgb).
if zLevel is over the currentZlevel, the original channels are blurred with the new data using the following equation… blend += vec4(newBucket.gba, 0).

The three previous scenarios are used to match the compressed RGBA data between the two values, these equations are not necessary for the two other axis (non depth ones) since the blurring is done inside the same depth bucket.

For the non depth axis a simple box blur is used defining the blurring direction based on a user provided uniform uAxis, this allows to make the 3D blur in three different separable passes. Notice that the filtering done is based on box coefficients, but these could be modified to make a gaussian filter or any type of filtering that could provide better results for the potential to simulate.

The following image shows the result of the potential generation for the two previous 3D textures, notice that for the left image the buckets are separated avoiding blurring between the buckets. It can be seen in the left 3D texture that the appearance of the green and dark blue colors show how the different depths are being blended, meaning that the potential is being softened among the different depths.

Corners Calculation in the Potential Field

Once the potential is obtained, the marching cubes algorithm requires to evaluate the values in the corners of the voxels. In order to avoid software trilinear interpolations another quad pass is done over the result from the 3D blur passes to calculate values in the corners. Since it’s a quad evaluation the same vertex shader from the previous passes can be reused, and the corresponding fragment shader for the program is defined below.

precision highp float;
precision highp sampler2D;
uniform sampler2D uDataTexture;
varying vec2 vText;
vec3 data[7];
const vec4 c3D = vec4(1. / 2048., 256., 8., 1. / 8.);
vec2 index2D(vec3 pos) {

return c3D.x * (pos.xz + c3D.y * vec2(mod(pos.y, c3D.z), floor(pos.y * c3D.w)) + vec2(0.5));

}
void main(void) {

vec2 pos = floor(vText / c3D.x);
vec3 pos3D = vec3(mod(pos.y, c3D.y), c3D.z * floor(pos.y / c3D.y) + floor(pos.x / c3D.y), mod(pos.x, c3D.y));
data[0] = vec3(-1., -1., -1.);
data[1] = vec3(0., -1., -1.);
data[2] = vec3(0., 0., -1.);
data[3] = vec3(-1., 0., -1);
data[4] = vec3(-1., -1., 0.);
data[5] = vec3(0., -1., 0.);
data[6] = vec3(-1., 0., 0.);
float currentZLevel = floor(pos3D.y / 64.);
vec2 uv = index2D(pos3D);
uv.y = fract(uv.y);
vec4 corner = texture2D(uDataTexture, uv);
vec3 newPos3D = vec3(0.);
float zLevel = 0.;
for(int i = 0; i < 7; i ++) {
newPos3D = pos3D + data[i];
zLevel = floor(newPos3D.y / 64.);
uv = index2D(newPos3D);
uv.y = fract(uv.y);
vec4 newBucket = texture2D(uDataTexture, uv);
vec3 cases = vec3(bvec3(zLevel < currentZLevel, zLevel == currentZLevel, zLevel > currentZLevel));
corner += vec4(0., newBucket.rgb) * cases.x + newBucket * cases.y + vec4(newBucket.gba, 0.) * cases.z;

}
gl_FragColor = vec4(corner * 0.125);

}

Notice that the corners evaluation is done using the compressed blurred texture from the previous pass, this means that the same blending equations have to be used to match the different channels based on the depths regions of the fragments to use. With this final pass the algorithm has two potentials textures, one for center values and other for corner values, this last one is the texture used for the marching cubes steps.

Voxel Potential Field Expansion

For the marching cubes steps it’s not necessary to work with a compressed texture, hence a expansion step is used to obtain a 3D texture where each voxel is represented using a single fragment from the new texture. This is also done using a fragment shader defined below.

precision highp float;
precision highp sampler2D;
uniform sampler2D uDataTexture;
const vec4 c3D_h = vec4(1. / 4096., 256., 16., 1. / 16.);
const vec4 c3D_l = vec4(1. / 2048., 256., 8., 1. / 8.);
varying vec2 vText;
void main(void) {

//Obtain the 3D position of the corresponding fragment in the expanded texture
vec2 pos = floor(vText / c3D_h.x);
vec3 pos3D = vec3(mod(pos.y, c3D_h.y), c3D_h.z * floor(pos.y / c3D_h.y) + floor(pos.x / c3D_h.y), mod(pos.x, c3D_h.y));
//Define the depth range for the corresponding fragment.
float zLevel = floor(pos3D.y / 64.);
vec2 uv = c3D_l.x * (pos3D.xz + c3D_l.y * vec2(mod(pos3D.y, c3D_l.z), floor(pos3D.y * c3D_l.w)) + vec2(0.5));
uv.y = fract(uv.y);
//Set the value of the fragment based on the depth and the corresponding channel.
gl_FragColor = vec4(dot(texture2D(uDataTexture, uv), vec4(equal(vec4(zLevel), vec4(0., 1., 2., 3.)))));

}

The previous shader expands a 3D compressed RGBA texture allocated in a 2048×2048 2D texture where the visual buckets are allocated in 8 x 8 array of 256×256 fragments buckets. The result is a 4096×4096 texture where the visual buckets are uncompressed into an array of 16 x 16 buckets of 256×256 fragments. The different constant values for the expansion are defined in c3D_h and c3D_l.

The shader evaluates the 3D position of the fragment in the expanded texture and search for that value in the compressed texture, using the depth range to define with RGBA channel will be read to obtain the corresponding information. The image below shows the result of the expansion from the texture.

The left side of the previous image shows the compressed data in a 2048^2 area, allowing to calculate the blur and corners evaluation much faster than using a non compressed data. A speed up to 4X is gained based on the fact that the compressed texture area is 4 times smaller than the non compressed one. The right side shows the expanded result where the 256 depth buckets can be inspected visually without color interpretations.

Once the potential is generated it’s time to obtain the active voxels to use for the marching cubes, and generate the corresponding vertices, this will be explained in a second post, finally a third post will explain the javascript implementation and provide different examples with it.

### Stream Compaction in WebGL

15, Oct 2016/Categories 3D, webGL/No Comments

One of the big limitations when doing GPGPU computing is that since the architecture is designed to work on parallel, re arranging or compacting data is a non trivial task on the GPU, to overcome this limitation different stream compaction methods have been created, being histopyramids the algorithm that will be discussed in this post. Histopyramids are useful since it allows to allocate the data at the beginning of one texture, avoiding to evaluate complex shaders in fragments that don’t have useful information, that’s why this process is quite important making general computation using graphic cards.

The following image shows what is achieved with stream compaction. The texture represent a set of slices from a simulated 3D texture showing the different Z levels, in a 2D representation the data is scattered all over the texture, The colored line represents the same data compacted and displaying the 3D position of the voxels reallocated.

Algorithm:

Implementing histopyramids is not a straightforward process, it requires two different steps which are reduction and traversal (compaction). The first phase, reduction, is the process where the algorithm generates a set of textures levels based on the original texture (base texture) holding the scattered data; the amount of textures to generate in the reduction process are based on the size of the original texture using the following equation.

Textures to Generate = Log2(original texture size). (for power of two textures) (1)

For the reduction process, a base texture has to be generated from the original data texture, the idea is to define a binary image where each position will be marked with a RGBA value of vec4(1.) in the binary new texture. This would define a transparent texture where the white texels represent the data’s position allocated in the scattered original texture.

This first step is quite similar as the mipmapping generation from the GPUs, but instead of using a median reduction for the generation of each texture, each pixel of the parent level of the pyramid is calculated using the sum of the four pixels contiguous of the leaf level texture. At the end of the reduction process a pyramid of images is created where each level represents the sum of the pixels from the leaf levels. The top level of the pyramid, a 1×1 texture represent the sum of all the data present in the base texture.

Once the reduction process is completed and the pyramid is generated, the traversal is required to compact the data. The total data to be allocated in the compacted texture is defined in the top level of the histo-pyramid (the 1×1 texture). Each one of the output elements of the resulting stream has to traverse the pyramid to finally obtain a UV position (texture coordinates). This texture position relates the output stream with a data value from the scattered texture.

The traversal is done using a key index for each texel from the output texture, meaning that each pixel would be associated with an incremental unique 1D index that would be used to compare against the different levels of the pyramid. Also each texel from the pyramid would be re interpreted as a key interval value.

This second step starts at the first level of the pyramid, checking if the key from the texel to evaluate is inside the total range of the data to compact. To do so the key value has to be lower than the total sum allocated at the top level of the pyramid. If the key is lower than the total range the traversal goes to the second level of the histopyramid, in the 2×2 texture, where each texel value can be represented as “a, b, c, d” and the following ranges can be generated:

A = [0, a), B = [a, a+ b), C = [a + b, a + b + c), D = [a + b + c, a + b + c + d)

Each traversal step evaluates where in the ranges the key index of the output stream can be allocated (or falls into), that range gives a 2D texture position for each level traversed. On each new level a new set of ranges have to be generated and a new UV position is extracted based on the range where the key falls into. It’s important to specify that the ranges are generated in a “Z” order from the four adjacent pixels.

In the final phase, the base texture is used and the ranges would be of a unit difference, meaning that this step gives a UV coordinate that can be used to read the corresponding data in the scattered texture and place that data in the output telex from the stream.

The previous image shows the traversal in two levels for the output texel with a key value of “4”, the sum of all the values is represented as the 1×1 texture which value is “9”, the first level represented with a texture of 2×2 has four values that defines the following ranges:

A = [0, 3), B = [3, 5), C = [5, 8), D = [8, 9)

The key evaluated falls in the B range, hence for the next level the algorithm should use the four values allocated in the [0, 1] position, in this level the new ranges are defined as:

A = [3, 3),  B = [3, 4), C = [4, 5), D = [5, 5)

In this case the key falls into the [2, 1] position of the base texture, since this texture is a 4×4 texture the corresponding UV coordinate would be [0.5, 0.25]. This coordinate would be used to read the data from that position in the scattered texture and write the read value inside the position relative to the key used to traverse the pyramid.

The implementation in webGL will be done using plain javascript, this allows to transfer the code to whatever library the user would like to use (three.js, regl, babylon.js). This tutorial also assumes a certain level of understanding of basic webGL operations like shaders compilation, quad generations and the use of the different tests (scissor, depth, stencil) that can be applied in the framebuffers. If the user feel the need to revisit these concepts the following URL (http://learningwebgl.com/blog/?page_id=1217) is a good place to visit. First the required shaders will be discussed based on the algorithm previously explained, and then the javascript implementation will be addressed based on the programs generated.

Reduction Step:

Since all the important operations will be done in shaders, there are two main programs that have to be created, a reduction program and a traversal program. All the programs will be performed over a quad, meaning that the vertex shader is only responsible to generate the vertex positions of the quad and to provide a UV coordinate for each fragment in the fragment shader.

The vertex shader is defined with the following code:

```//attribute for each vertex, in this case a simple index key. attribute float aVI; //the output UV coordinate for each fragment varying vec2 vText; void main(void) {      float vI = aVI + 1.; //Generate the position of the vertex based on its ID      vec2 xy = vec2(mod(vI, 2.) == 0. ? 1. : -1., -1. + 2. * step(-vI, -2.1)); //Calculate the UV coordinate of the vertex to use for interpolation      vText = xy * 0.5 + 0.5; //Define the position of the vertex.      gl_Position = vec4(xy, 0.0, 1.0); }```

The vertex shader evaluates all the data for the vertex on execution time, but the user can also provide attributes with positions and UV coordinates for each vertex if desired.

The fragment shader works similar to a mipmapping shader, hence four texels from the input texture have to be read and the resulting color would be the sum of those four values. This can be done with the following shader.

```precision mediump float; precision mediump sampler2D; //input texture (previous level used to evaluate the new level) //it must be a float point texture. uniform sampler2D uPreviousLevel; // 1. / size of the previous level texture. uniform float uSize; //uv coordinate of each fragment obtained from vertex shader. varying vec2 vText; void main(void) { //This value is used to evaluate the adjacent pixel from the evaluated one     float k = 0.5 * uSize; //position of the texel to read from the input texture     vec2 position = floor(vText / uSize) * uSize; //base texel     float a = texture2D(uPreviousLevel,  position + vec2(0., 0.)).r; //texel to the right     float b = texture2D(uPreviousLevel,  position + vec2(k, 0.)).r; //texel on top     float c = texture2D(uPreviousLevel,  position + vec2(0., k)).r; //texel on the diagonal (top & right)     float d = texture2D(uPreviousLevel,  position + vec2(k, k)).r; //The frag color is the sum of the values read previouly     gl_FragColor = vec4(a + b + c + d); }```

The fragment shader reads the adjacent information from the evaluated fragment and sums that data to obtain the final fragment color. Since this shader is applied to all the different levels of the pyramid during the reduction process, it requires to know the actual size of the texture evaluated, this is provided in the uSize uniform being more conveniently defined as the inverse of the size value.

For the traversal process there are two ways of addressing the problem, writing the traversal in the vertex shader or writing it in the fragment shader. Both ways require to know how many vertices of fragments would be required to traverse the pyramid, so this should be controlled using javascript instructions. Since a vertex shader for a quad generation is already defined this new program will make the traversal in the fragment shader using the following code.

```precision lowp float; precision lowp sampler2D; //pyramid texture that holds all the levels previously generated in the reduction process uniform sampler2D uPyramid; //Original texture with the scattered data. uniform sampler2D uBase; //1x1 texture containing the total amount of data to compress. uniform sampler2D uTotalData; //UV coordinate for each fragment. varying vec2 vText; //Uniforms used for the execution on each level, defined in Javascript. uniform vec4 uTexture3D; void main(void) {     float vVI = dot(floor(uTexture3D.x * vText), vec2(1., uTexture3D.x)); //This avoids to evaluate fragments that generate keys higher than the total //amount of data to compress.     if (vVI > texture2D(uTotalData, vec2(0.5)).r) discard;     vec2 relativePosition = vec2(0.);     vec2 position = vec2(0.);     float start = 0.;     float end = 0.;     vec4 starts = vec4(0.);   vec3 ends = vec3(0.);     float offset = uTexture3D.x - 2.;     float div0 = 1. / uTexture3D.y;     float k = 1. / uTexture3D.x;     float resta = 2.;     vec2 pos1 = vec2(0.);     vec2 pos2 = vec2(0.);     vec2 pos3 = vec2(0.);     vec2 pos4 = vec2(0.);     vec3 mask = vec3(0.);     vec4 m = vec4(0.); //The traversal has to be done in all the levels of the pyramid, since the for loop requires //a constant value a maximum of 14 levels can be evaluated. This allows to make a stream //compaction for a 4096x4096 texture.     for(int i = 1; i < 14; i++) { //if the level is higher than the total levels that the pyramid has it breaks, this is required for //pyramids with levels lower than 14.         if(float(i) >= uTexture3D.w) break; //The offset sets the texture reading to the position of the level to use in the pyramid texture.         offset -= resta;         resta *= 2.; //Here the shader evaluates the three initial adjacent positions and calculates the //ranges.         relativePosition = position + k * vec2(offset, 0.);         end = start + texture2D(uPyramid, relativePosition).r;         pos1 = relativePosition;         starts.x = start;         ends.x = end;         pos2 = relativePosition + vec2(k, 0.);         starts.y = ends.x;         ends.y = ends.x + texture2D(uPyramid, pos2).r;         pos3 = relativePosition + vec2(0., k);         starts.z = ends.y;         ends.z = ends.y + texture2D(uPyramid, pos3).r; //The fourth range doesn’t require to read the fourth pixel, if the key is not in the first three ranges //it should be allocated in the fourth range         pos4 = relativePosition + vec2(k, k);         starts.w = ends.z;         mask = vec3(greaterThanEqual(vec3(vVI), starts.rgb)) * vec3(lessThan(vec3(vVI), ends));         m = vec4(mask, 1. - length(mask)); //the “m” value is a vector of four values which only one of those would be different than //zero, to avoid if branching the four ranges are evaluated and the casted value, from //boolean to float are used to define the new relative position (uv for the next level).         relativePosition = m.x * pos1 + m.y * pos2 + m.z * pos3 + m.w * pos4;         start = dot(m, starts); //Sets the position for the next level traversal.   position = 2. * (relativePosition - k * vec2(offset, 0.));     } //The same traversal operation is done with the base texture instead of the pyramid texture.     end = start + texture2D(uBase, position).r;     pos1 = position;     starts.x= start;     ends.x = end;     pos2 = position + vec2(k, 0.);     starts.y = ends.x;     ends.y = ends.x + texture2D(uBase, pos2).r;     pos3 = position + vec2(0., k);     starts.z = ends.y;     ends.z = ends.y + texture2D(uBase, pos3).r;     pos4 = position + vec2(k, k);     starts.w = ends.z;     mask = vec3(greaterThanEqual(vec3(vVI), starts.rgb)) * vec3(lessThan(vec3(vVI), ends));     m = vec4(mask, 1. - length(mask));     position = m.x * pos1 + m.y * pos2 + m.z * pos3 + m.w * pos4;     vec2 index = position * uTexture3D.x; //The color for this example is the 3D position relative to the 2D position of the UV obtained. //This could also be defined as the data from the scattered texture that is allocated in the //UV obtained (position). //This could be done using gl_FragColor = texture2D(scatteredTexture, position);     gl_FragColor = vec4(div0 * vec3(mod(index.y, uTexture3D.y), uTexture3D.z * floor(index.y * div0) + floor(index.x * div0), mod(index.x, uTexture3D.y)), texture2D(uBase, position).a); }```

The previous traversal shader requires 3 important inputs that are defined below:

– uTexture3D: this vector has four values that represent the following data:
* uTexture3D.x = size (length) of the base texture.
* uTexture3D.y = size of the 3d texture simulated in the 2d texture (ex. a 128x128x128 3d texture can be allocated in a 2048×2048 texture, the corresponding data for this value would be 128).
* uTexture3D.z = number of buckets of the 3D texture allocated on one side of the 2d texture. (based on the previous example the 2048 texture side can allocate up to 16 buckets of 128 x 128 layers).
* uTexture3D.w = quantity of levels generated in the reduction process.

Notice that this data is mostly used to evaluate the 3D position based on the UV position obtained in the final traversal (this traversal is adjusted for a marching cubes application), so on a generic case only two values would be required for this shader, the size of the base texture and the amount of levels generated in the reduction passes.

– uBase: this would be the base texture generated from the scattered data.

– uTotalData: this is the 1×1 level texture that defines the total data to compress.

– UPyramid: a single 2D texture that holds all the different levels generated from the base texture, since all the levels are defined as half of the size from the previous level. This texture is generated in javascript copying the data from each level, when it’s generated, to a single texture using the copyTexSubImage command.

Having a single pyramid texture avoids the use of different uniforms for each level, which also limits the shader to a defined amount of maximum levels based on those uniforms. The image below shows an example of a Pyramid texture generated in the reduction process.

Another advantage of this single image is that it allows to debug issues in the reduction process, since it’s straightforward to visualize any issues that could arise when reductions are generated.

The traversal process is done using a for loop which is limited by a constant upper bound since webGL only allows to use constants to define the loop, this shaders is limited to up to 14 traversal loops, hence textures up to 4096×4096 can be used in this shader. To avoid the use of if branching the traversal evaluates the first three ranges using vector masking and logical comparisons, the final range comparison can be evaluated as the difference between the total range of the whole loop and the first three partial ranges evaluated before, that avoids to evaluate another texture read.

Since the quantity of loops are the same one for each thread is the same, and there are no “if” branching inside each iteration, all the threads will execute the same amount of instruction ensuring that the shader won’t have any divergence.

The use of a unique pyramid texture requires to use an offset position that has to be updated for each new loop, this offset has to be included inside the UV coordinate obtained to traverse the next level. Since the base texture is not included in the pyramid texture, the last traversal is done outside the loop, this final step uses the base texture to define the final UV positions required.

The traversal is done in the fragment shader where each fragment has an unique 1D key defined by it’s position on the texture, the key generated is used for the traversal. If the user would like to write the parser in the vertex shader it’s important to define two attributes for the vertices, the first one would be the position of the vertex in the final texture and the second attribute would be the key for that vertex to use in the traversal process.

One important thing of this algorithm is that it doesn’t require scattering (or repositioning) of the data (vertices), hence all the final positions of the data will be known in advance. In that regard using the fragment shader for the traversal has the advantage of avoiding positioning the vertices with the vertex shader.

On a final note the user can observe that if the traversal shader is applied to the whole texture, there will be fragments that will generate keys outside of the total amount of data to compress To avoid this issue the level0 pyramid texture (1×1 texture) is provided as an uniform  using its value (the total data to compress) to compare against the key generated by the fragment’s position. If the key is higher than the value from the texture the fragment can be discarded. This could also be done with the final level allocated in the pyramid texture since the 1×1 texture is also copied into the pyramid.

Packing and unpacking of floating point values.

The last two programs are required to read floating point values in webGL using the gl.readPixel() command. This command only accepts 8 bit RGBA values to be read, hence the sum represented in the 1×1 texture can only be read packing the data from a floating point value to a RGBA 8 bit value that represents the 32 bits of the original data. To do so the following fragment shader is run over a 1×1 quad.

``` precision highp float; precision highp sampler2D; uniform sampler2D uPyT; //Max cells that could be active in a texture uniform float uMax; varying vec2 vTextureUV; void main(void) {      vec4 enc = fract(vec4(1., 255., 65025., 160581375.) * texture2D(uPyT, vec2(0.)).r * uMax);      enc -= enc.yzww * vec4(vec3(0.00392157), 0.);      gl_FragColor = enc; } ```
The previous packing shader is based on the information found at http://aras-p.info/blog/2009/07/30/encoding-floats-to-rgba-the-final/ it also explains how to decode the RGBA value to a floating point meaningful information in Javascript.

Javascript implementation:

For the javascript implementation two helper functions are written to generate the corresponding required textures and frame buffers shown below

``` function createTexture(textureSize, format, maxFilter, minFilter, type, data) {     var texture = gl.createTexture();     texture.size = textureSize;     gl.bindTexture(gl.TEXTURE_2D, texture);     gl.texImage2D(gl.TEXTURE_2D, 0, format, textureSize, textureSize, 0, format, type, data);     gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, maxFilter);     gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, minFilter);     gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_WRAP_S, gl.CLAMP_TO_EDGE);     gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_WRAP_T, gl.CLAMP_TO_EDGE);    return texture; } function createFramebuffer(texture) {     var gl = gl;     var frameData = gl.createFramebuffer();     gl.bindFramebuffer(gl.FRAMEBUFFER, frameData);     frameData.size = texture.size;     var renderbuffer = gl.createRenderbuffer();     gl.bindRenderbuffer(gl.RENDERBUFFER, renderbuffer);     gl.renderbufferStorage(gl.RENDERBUFFER, gl.DEPTH_COMPONENT16, texture.size, texture.size);     gl.framebufferTexture2D(gl.FRAMEBUFFER, gl.COLOR_ATTACHMENT0, gl.TEXTURE_2D, texture, 0);  gl.framebufferRenderbuffer(gl.FRAMEBUFFER, gl.DEPTH_ATTACHMENT, gl.RENDERBUFFER, renderbuffer);     return frameData; } ```

This two functions are pretty straightforward, they just generate a texture with a clamp to edge filtering, allowing the user to define if the texture would be a floating point texture or an unsigned_byte, also if there’s data provided it’s passed to the texture during the creation of it. The second function allows to generate a frame buffer based on a texture provided, it has a color attachment for the texture and a depth attachment for z buffering if required.

With the previous function a set of textures are generated for the pyramid using the following code:

```        tLevels = [];        fbPyramid = [];    tPyramid = createTexture(baseTextureSize, gl.RGBA, gl.NEAREST, gl.NEAREST, gl.float, true);     for (var i = 0; i < Math.ceil(Math.log(baseTextureSize) / Math.log(2)); i++) {         tLevels.push(createTexture(Math.pow(2, i), gl.RGBA, gl.NEAREST, gl.NEAREST, gl.float, true));         fbPyramid.push(createFramebuffer(tLevels[i]));     } ```

To read the data from the top level of the pyramid the following texture (unsigned_byte is used for this case to read it in Javascript using gl.readPixel)

```     tRead = createTexture(1., gl.RGBA, gl.NEAREST, gl.NEAREST, gl.UNSIGNED_BYTE, true);     fbRead = createFramebuffer(tRead); ```

Notice that the total levels to generate is based on equation (1), since the logarithmic function in Javascript is 10 based, the resulting value has to be divided by the constant “Math.log(2)” to change it to base 2. The baseTextureSize variable represents the length of the original scattered texture. The textures and frame buffers are saved in two arrays to be used in the compaction function.

tRead and fbRead makes reference to the texture and frambuffer used to make the packing of the floating point value present at the top level of the pyramid to be transferred to this texture using the RGBA 8bit encoding.

The compaction function is where the shaders will be executed, to so do three programs had to be generated beforehand using the shaders explained in the previous sections. the javascript application saves these programs using an array using string keys to invoke the corresponding program. The three required programs are:

“GENERATE_PYRAMID” program; this is generated using the quad shader and the reduction shader, it requires an ID attribute for the four vertex that generate the quad. It also requires to declare the needed uniforms for the reduction step, mainly the sampler used for the reduction, and the size of each sampler

“PACK_DATA” program: to compile this program the quad vertex shader is used and the packing shader is the corresponding fragment shader. It requires the same attribute than the previous program, and the uniforms are the 1×1 texture sampler input and a value representing the max value available to encode with the shader.

“PARSE_PYRAMID”: this last program uses the same vertex shader than the previous two ones and the fragment shader is the one that defines the traversal. The uniforms used are defined in the previous explanation for this shader.

To use the compaction function the three programs have to be generated, compiled and have their uniforms and attribute declarations ready. Pointers to the shaders uniforms are saved in the same array position that holds each program as a dynamic attribute, these pointers are used to send the data to the GPU on each invocation of the compaction function.

Compaction function

Once the textures are generated a compaction function is written to be used every time the application requires it, compaction can be done on every time the original data changes, hence it’s a good idea to have it on a helper function to call it whenever it’s needed. The following code shows the javascript calls to run the shaders.

``` function createPyramid(initialTexture, buffer, readCells, expectedCells) {    var totalCells = expectedCells;     //This part set the levels     var levels = Math.ceil(Math.log(initialTexture.size) / Math.log(2));     gl.useProgram(program["GENERATE_PYRAMID"]);     bindAttribBuffer(program["GENERATE_PYRAMID"].vertexIndex, this.bufferIndex, 1);     var offset = 0;     for (var i = 0; i < levels; i++) {         gl.bindFramebuffer(gl.FRAMEBUFFER, this.fbPyramid[levels - i - 1]);         var size = Math.pow(2, levels - 1 - i);         gl.viewport(0, 0, size, size);         gl.clear(gl.COLOR_BUFFER_BIT);         gl.uniform1f(program["GENERATE_PYRAMID"].size, Math.pow(2, i + 1) / initialTexture.size);         bindTexture(program["GENERATE_PYRAMID"].potentialTexture, i == 0 ? initialTexture : this.tLevels[levels - i], 0);         gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);         gl.bindTexture(gl.TEXTURE_2D, tPyramid);         gl.copyTexSubImage2D(gl.TEXTURE_2D, 0, offset, 0, 0, 0, size, size);         gl.bindTexture(gl.TEXTURE_2D, null);         offset += size;     }     gl.disableVertexAttribArray(program["GENERATE_PYRAMID"].vertexIndex);     //This part reads the total active cells     if (readCells) {         gl.useProgram(program["PACK_DATA"]);         bindAttribBuffer(program["PACK_DATA"].vertexIndex, this.bufferIndex, 1);         gl.uniform1f(program["PACK_DATA"].maxData, 1. / Math.pow(initialTexture.size, 2));         bindTexture(program["PACK_DATA"].readPixelTexture, this.tLevels[0], 0);         gl.bindFramebuffer(gl.FRAMEBUFFER, this.fbRead);         gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);         var pixels = new Uint8Array(4);         gl.readPixels(0, 0, 1, 1, gl.RGBA, gl.UNSIGNED_BYTE, pixels);         totalCells = Math.round(16448 * (pixels[0] + pixels[1] / 255 + pixels[2] / 65025 + pixels[3] / 160581375));         console.log("counted cells: " + totalCells);     }     gl.disableVertexAttribArray(program["PACK_DATA"].vertexIndex);     //This part parses the pyramid for compaction     gl.useProgram(program["PARSE_PYRAMID"]);     bindAttribBuffer(program["PARSE_PYRAMID"].vertexIndex, this.bufferIndex, 1);     bindTexture(program["PARSE_PYRAMID"].pyramid, this.tPyramid, 0);     bindTexture(program["PARSE_PYRAMID"].base, initialTexture, 1);     bindTexture(program["PARSE_PYRAMID"].totalData, tLevels[0], 2);     gl.uniform4f(program["PARSE_PYRAMID"].gridPartitioning, this.texture3DSize, this.gridPartitions, this.sideBuckets, levels);     gl.bindFramebuffer(gl.FRAMEBUFFER, buffer);     gl.viewport(0, 0, baseTextureSize, baseTextureSize);     gl.clear(gl.COLOR_BUFFER_BIT);     gl.enable(gl.SCISSOR_TEST);     var height = Math.ceil(totalCells / baseTextureSize);     gl.scissor(0, 0, baseTextureSize, height);     gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);     gl.disable(gl.SCISSOR_TEST);     gl.disableVertexAttribArray(program["PARSE_PYRAMID"].vertexIndex);     return totalCells; } ```

The previous function required the initial texture to be compacted, the frame buffer where the result will be saved and an expected amount of the data to process if the user doesn’t want to use the gl.readPixels method to accelerate the process. The function itself reflects the two steps of the algorithm (reduction and compaction) in two different program uses, in between the user can decide to read the amount of data in the original texture using the readCells boolean variable, defining a third optional step (in between the reduction and compaction steps).

The first step, reduction, invokes a set of quad draw calls on the different pyramid textures using the reduction program, for each iteration a new size is calculated and the corresponding uniforms are send. After the new level is generated the end result is copied to the pyramid texture using the copySubTexImage method, using an offset to separate each level of the pyramid in the same texture.

This first step doesn’t require to use a depth test, but since the function can be run on each frame is important to clear the color buffer for each level. Notice that the program used for this step is the “GENERATE_PYRAMID” program.

Once the pyramid is generated an optional second step is used to determine the amount of data allocated in the scattered texture, to do so the “PACK_DATA” program is used, and the unpacking of the data is done in Javascript as shown in this article http://aras-p.info/blog/2009/07/30/encoding-floats-to-rgba-the-final/

Reading this data in javascript represents the bottleneck of this function, since this requires to synchronize the CPU with the GPU respond, meaning that the application can suffer an idle state while waiting for this information. This means that if the user opts to read the data on every frame the total framerate of the application can be hurt just trying to have a precise amount of data to evaluate after the compaction process.

To overcome this limitation an expected amount of data is defined as part of the arguments of the function to supply a way to avoid the use of the gl.readPixels method. The idea is that since the GPU can read the data in the traversal shader any fragment that doesn’t require to be traversed will be discarded, and for future steps the user can safely assume that the data processed is below the expectance defined by the user.

The traversal process invokes the “PARSE_PYRAMID” program and it generates a quad draw using the scissor test to avoid any operations on fragments that are outside of the range of data to compact. To calculate the region of fragments for the scissor an area defined as the size of the texture and the ratio between the totalCells divided by the size of the texture is used, if the total cells are not read using the second step then the expected amount of cells is used. With the scissor test and the discard method the traversal function is accelerated.

The function returns the total cells read in the second step, or the expected cells defined by the user (the value gets unmodified), it also saves the compacted data in the provided frame buffers ready to be used for subsequent GPGPU operations.

The previous function used to helper functions to bind textures to the corresponding program and to bind attributes too, these functions are defined below.

``` function bindAttribBuffer(attribLocation, buffer, bufferDataSize) {     gl.enableVertexAttribArray(attribLocation);     gl.bindBuffer(gl.ARRAY_BUFFER, buffer);     gl.vertexAttribPointer(attribLocation, bufferDataSize, gl.FLOAT, false, 0, 0); } function bindTexture(programData, texture, texturePos) {     var textures = [gl.TEXTURE0, gl.TEXTURE1, gl.TEXTURE2, gl.TEXTURE3, gl.TEXTURE4, gl.TEXTURE5, gl.TEXTURE6, gl.TEXTURE7, gl.TEXTURE8, gl.TEXTURE9, gl.TEXTURE10, gl.TEXTURE11, gl.TEXTURE12, gl.TEXTURE13, gl.TEXTURE14];     gl.activeTexture(textures[texturePos]);     gl.bindTexture(gl.TEXTURE_2D, texture);     gl.uniform1i(programData, texturePos); } ```

The previous shaders are a straightforward implementation from the algorithm explained before, but there is a quite nice improvement that can be done in the traversal shader to jump from three texture reads on each level to only one; this is done using vec4 histopyramids. The following image illustrates the modifications to the algorithm in order to implement this variant.

These kind of pyramids make the reduction in a different manner, since the GPU allows to work with four channels per pixel the reduction shader can be rewritten in order to save the partial sums of the four pixels read instead of the total sum done before. This can be seen in the following fragment shader.

``` precision mediump float; precision mediump sampler2D; uniform sampler2D uPreviousLevel; uniform float uSize; varying vec2 vText; void main(void) {     float k = 0.5 * uSize;     vec2 position = floor(vText / uSize) * uSize;     float a = texture2D(uPreviousLevel,  position + vec2(0., 0.)).r;     float b = texture2D(uPreviousLevel,  position + vec2(k, 0.)).r;     float c = texture2D(uPreviousLevel,  position + vec2(0., k)).r;     float d = texture2D(uPreviousLevel,  position + vec2(k, k)).r;     gl_FragColor.a = a;     gl_FragColor.b = a + b;     gl_FragColor.g = gl_FragColor.b + c;     gl_FragColor.r = gl_FragColor.g + d; } ```

In the previous shader the program saves the partial sum on each channel of the final color, this allows to read the ranges using only the parent pixel, which means that the traversal only requires to allocate one texel from the texture level to traverse. Also the ranges can be evaluated faster since once the end of each range is defined creating the start values of the ranges is trivial. The following shader shows the implementation of the traversal with only one texture read.

``` precision lowp float; precision lowp sampler2D; uniform sampler2D uPyramid; uniform sampler2D uBase; varying vec2 vText; uniform vec4 uTexture3D; void main(void) {     float start = 0.;     vec4 starts = vec4(0.);     vec4 ends = vec4(0.);     float offset = uTexture3D.x - 2.;     float div0 = 1. / uTexture3D.y;     float k = 1. / uTexture3D.x;     float diff = 2.;     float vVI = dot(floor(uTexture3D.x * vText), vec2(1., uTexture3D.x));     vec4 m = vec4(0.);     vec2 relativePosition = k * vec2(offset, 0.);     vec2 position = vec2(0.);     vec4 partialSums = texture2D(uPyramid, relativePosition);     for(int i = 1; i < 13; i++) {         if(float(i) >= uTexture3D.w) break;         offset -= diff;         diff *= 2.;         relativePosition = position + k * vec2(offset, 0.);         vec2 pos1 = relativePosition;         vec2 pos2 = relativePosition + vec2(k, 0.);         vec2 pos3 = relativePosition + vec2(0., k);         vec2 pos4 = relativePosition + vec2(k, k);         ends = partialSums.wzyx + vec4(start);         starts = vec4(start, ends.xyz);         m = vec4(greaterThanEqual(vec4(vVI), starts)) * vec4(lessThan(vec4(vVI), ends));         relativePosition = m.x * pos1 + m.y * pos2 + m.z * pos3 + m.w * pos4;         start = dot(m, starts);         position = 2. * (relativePosition - k * vec2(offset, 0.));         partialSums = texture2D(uPyramid, relativePosition);     }     vec2 pos1 = position;     vec2 pos2 = position + vec2(k, 0.);     vec2 pos3 = position + vec2(0., k);     vec2 pos4 = position + vec2(k, k);     ends = partialSums.wzyx + vec4(start);     starts = vec4(start, ends.xyz);     m = vec4(greaterThanEqual(vec4(vVI), starts)) * vec4(lessThan(vec4(vVI), ends));     position = m.x * pos1 + m.y * pos2 + m.z * pos3 + m.w * pos4;     vec2 index = position * uTexture3D.x;     gl_FragColor = vec4(div0 * vec3(mod(index.y, uTexture3D.y), uTexture3D.z * floor(index.y * div0) + floor(index.x * div0), mod(index.x, uTexture3D.y)), texture2D(uBase, position).a); } ```

Notice that the traversal shader doesn’t change too much, the texture fetchs and the ranges evaluation are based on a new variable, “partialSums”. This variable reads the partial sums from the parent level so there’s no need to make multiple texture fetching. Finally these two shaders can replace the previous ones with the same javascript implementation.

In the next post a marching cubes implementation for WebGL will be explained using this compaction method.

Bibliography:

http://heim.ifi.uio.no/~erikd/pdf/hpmarcher_draft.pdf (original paper from Gernot Ziegler et al)

http://folk.uio.no/erikd/histo/hpmarchertalk.pdf (pdf with explanations Gernot Ziegler et al)

### Fluid Simulation with SPH (Smoothed particle hydrodynamics) in WebGL

29, Aug 2011/Categories 3D, webGL/No Comments

VIEW THE EXAMPLE

The next video is recorded on real time with an AMD Radeon 6970M HD (2Gb), feel free to watch the video if the simulation does not run in your computer.

When I was reading information about shadows particles for the previous posts, I saw one demo that really called my attention, this demo used one technique for fluids simulations that I really wanted to see how it worked. If you have seen the demo you will probably know this blog, here I found that the fluids where made using SPH (Smoothed particles hidrodinamics) for the simulation, so now I had a place to start searching for information. What I found was some good links that explain very well the maths behind the system.

If you want to understand how the simulation works you should read the paper from Matthias Müller, it explains the basics of fluid simulations using particles to calculate the properties of the fluid. You can also read this paper from Harada, this last paper is best suited for a GPU implementation.

There are two more links you should reed, Micky Kelager also explains the SPH, but the most important here is that you find how to deal with the collisions in the simulation using distance fields and he also defines some ways to calculate some important coeficients for the simulation. Another good link is the post from Iñigo Quilez, he defines some useful analytics distance fields that we can use to create collisions objects in the simulation.

Finally the GPU GEMS 3 has a good chapter for rigid body simulations that explains very well the creation of uniforms grids to find the neighbors of one.

The implementation

The process have some few steps explained in the next image, the most important of then are the neighborhood search and the velocity update. I do not explain too much in here because the previous links explain much better than me the implementation of SPH in a GPU.

Neighborhood search & Velocity update:

One of the most important thins in the simulation is to find the particles around the fragment particle, because they are responsible of the forces that will be applied to the simulation. We implemented a uniform grid based on the paper from Harada. This grid is a 3D voxel partition of the space that allows to keep track of up to four particles located in the voxel. A closer look of the algorithm is presented in the next image (taken from GPU GEMS 3).

The velocity update needs that the collision information among the colliding objects and the particles so you have to implement it in this step, what we did is to create spring forces based on the paper of Kelager using distance fields, one signed box, defined in the post from Iñigo.

The coeficients

Using the SPH to simulate fluids require to define some coeficients that alter he properties of the fluid. These are:

gridSize: Change the quantity of the voxels used for the uniform grid, if you make a bigger gridSize you will have more definition for the simulation, but the overall performance will be hurt. If you change this parameter also change the quantity of particles for the simulation.

volume: The volume defines the total space that the particles will use in a rest state, it is also used to define the mass for the particle, so this means that if you change the total rest volume, you will be changing the also the mass of the particle.

pressureK: this variable change the strength of the pressure from each particle, this variable is important because it avoids particles from collapsing, on the other hand, if you make it too strong, the particle will repel each other too much and the simulation will be unstable.

viscosity: the viscosity is a damping factor among the velocities of the particles, it keep particles moving smoothly relative to each other.

maxSearchRatio: this is perhaps the most delicate variable in the simulation, it defines the maximum kernel smoothing search ratio for the weigh kernells. If this value is too slow, there will be no particles to interact for the fragment particle, so there will be no simulation, but in the other hand, if the ratio is to big, the forces wil not be smoothed properly, and there will be some instabilities in the simulation.

restitution: this value is responsible for the strength of the collision forces, this variable acts as a spring force for the boundaries, this way you can control how the particles react in a collision (in a elastic or inelastic way).

dt: this is the integration time for the simulation, since we are using a simple integration equiation “At+1 = At + dt * n”, values should be slow to avoid errors produced for big changes in big time steps.

quality: this variable defines the quantity of particles used for each voxel to calculate the densities and forces, changing this value speed up the process but the simulation suffers from lack of information, so the visual quality is somehow hurt.

You can also see the position, velocity and density field applied as shader colors in the particles changing the shader option in the control panel. Finally you can swith from point to billboard sphere rendering method. All these variables can be updated with the control panel, once you update them you only have to restart the simulation.

Finally here are some images of our first breaking dam simulation.

Simulation showing the densities of each particle.

Simulation showing the position field of the particles.

Simulation showing the velocity field of the particles.

We would like to thanks AlteredQ for the help of testing the simulator, sadly we couldn´t find a solution for some configurations, so we have made a video if you can not play the simulation correctly. The simulations works fine with these configurations:

– Mac OS (Lion) with AMD Radeom HD 6970M 2Gb using Google Chrome and Firefox.
– Mac OS (Snow Leopard) with AMD Radeom HD 6750M 1Gb using Google Chrome and Firefox.
– Mac Os (Leopard) with Nvidia 8800 GT 512Mb using Firefox.
– Max Os (Snow Leopard) with Nvidia 9600GT 512Mb using Google Chrome and Firefox.
– Windows XP with Nvidia 8800GT 512Mb using Google Chrome.
– Windows 7 with AMD Radeom HD 6750M 1Gb using Google Chrome and Firefox.
– Windows 7 with Nvidia 9600GT 512Mb using Google Chrome and Firefox.

Feel free to try it, and if it´s broken please let us know your configuration, and what exactly you have seen while you played the simulation. It is our first version of the implementation, so we hope to make it faster using a more convenient neighborhood search. We hope that future posts will go on two branches, SPH optimizations and particles rendering using marching cubes.

### Curl Noise + Volume Shadow particles

05, Aug 2011/Categories 3D, webGL/No Comments

When we try to code anything in our studio, the first thing we do is to search for information about the subject in order to find papers and techniques that could help us to develop what we want. The bad thing about it is that you loose a lot of time in Google trying to find that specific paper that reveals the “secrets” of  the math behind the code.

But some time thing are even worst, because we do not even know what we are searching for, and this is when a good starting link is a good “gift”. In our case finding this blog has been a perfect starting point to find GPU techniques and papers related to those techniques.

The work made by these guys is amazing, but more important is that they talk about all the methods used in their works, so you can start searching by your own if you like to do something like. In our case we wanted to make some particles animations like the ones found here, and they have the perfect starting point in this post.

Two main things we would have to deal with in order to get a simple particle animation, these are:

• Create a fluid like animation for the particles.
• Shade the particles to give some volumetric effect.

Curl Noise:

In the post from DirectToVideo we found two lines that called our attention, they talked about a technique to fake velocity fields for a procedural fluid flow called “Curl Noise” created by Robert Bridson, and this is how we found this paper. In this paper the author defines a divergence free noise suitable for velocities simulations.

Using divergence free potentials is important for a flow simulation because it avoids sinks in the flow (try to think that with this you will not get all the particles going to a single final position, like a hole in space). Implementing is quite simple because you only have to calculate the differentials of the potentials (noises) used to create your field.

To create the potentials used for the velocities we used the equations from the next image. The only assumption that we made is that every scalar component from the potential vector field is a 2D function that would satisfy the rotor operator.

The second group of equations explains our assumption, we defined each axis potential as a function of the other two axes in order to use 2D textures for the tree potentials, this way we do not need 3d volume textures for each potential. With this assumption we ended with tree 2d textures that defined our potential so the velocity would be defined as the rotor operator applied to the potentials field as you can see in the equations. To calculate the partial derivates we only had to calculate the finite difference the given textures, and the good thing is that since we have our point defined in a texture space we could use this position to get the potential value for the tree defined potentials.

Now that we know how to create the velocity field we needed tree good noise textures to work with, this is an issue because if we load some external textures our development would be tied to those images. So we decided to create the noise textures by ourselves.

Our first approach was to make textures using perlin noise functions, but we then found that they are somehow expensive to calculate so we opted for a more practical solution, simplex noise, and the best thing is that we found one webGL  hack here. With the noise functions solved we only needed to make it turbulent, so we found that we could add higher frequencies to get a turbulent noise, you can read a full explanation here.

This is how we ended creating our own textures that we could change in realtime (not in every step), and would allow us to find the best noise suitable for our needs. In the next image you can see the same base noise with a without turbulence.

Saving the potentials as textures add a new problem to solve, textures in webGL only saves 8 bits per channel, so if you naively try to save the potential value in one channel you will only save a 256 units space value. This means that if your noise is defined in the rank [0-1] you will find jumps of 1/256 in your noise. To solve this you can pack your one dimentional value in the RGB components of the textures, this would give you a resolution of 2**24 for a given value.

We use two functions to pack and unpack floating point values in the textures, these are defined like this:

vec4 pack(float value) {
return vec4(floor(value * 256.0)/255.0, floor(fract(value * 256.0) * 256.0)/255.0 , floor(fract(value * 65536.0) * 256.0)/255.0, 1.0);
}

float unpack(vec4 pos) {
return pos[0] * 255.0 / 256.0 + pos[1] * 255.0 / 65536.0 + pos[2] * 255.0 / 16777216.0;
}

The first function wraps one value in RGB components, so you can save one value with very high resolution, the second function unwrap the RGB components to the original value. Note that the function do not save negative values, so is up to the programmer to define how the unwrapped value is going to be treated. The bad news about it is that you need a whole texture to save one value, so if you want to save one 3D position of one texture you will need three textures for it.

The last paragraph defines the main bottleneck of the application, since we have to save the particles positions to run the simulation we need to traverse all the particles three times to save each axe position in a different texture. In the next image you can see one of the axis aligned position textures used.

To shade the particles we applied the same technique from our previous posts (http://www.miaumiau.cat/2011/06/shadow-particles-part-ii-optimizations/), but we modified it to accept a variable bucket quantity so we could adapt the final resolution of the shadows in the volume. Since we can set many more slices we saw that the interpolation used between two slices was not necessary, so we took it away to get some more speed.

Now that we can change the buckets count we had to look for the best compromise among the buckets size, the quality of the shadows volume, and the quality of the floors shadow, for a 2048 texture size we found that the initial 8×8 buckets count is a very good compromise for the given parameters. With this in mind if you want to implement good quality shadows in your volume we recommend 16X16 buckets slices, but the final shadow for the floor will be very slow. In the next image you can see one test of the shadows with and without color.

The process:

With the curl noise, the volume shadows and the floating point issues solved we ended with an algorithm of seven steps, this process requires five render targets changes and we need to traverse all the particles five times. Three times to save the particle´s current position, one for the shadows and one final to paint them all.

We also perform some window, or 2D, calculations over the fragment shader to get the correct blurring for the final composition, we also calculate the potentials using the fragment shader.

Two things should be implemented to get a correct particle´s rendering, first of all, if we want to render all the particles we will need to use blending, but this requires that all the particles get to be sorted (if not you will se much less particles than the ones you expect). The other thing is to use another texture to control the life of the particle. This way we don´t have to reposition the particle in the emitter every time it reached it´s “y” limit.

Some workarounds:

The previous commented process have a huge bottleneck saving all the particles positions, so we wanted something faster to get the particles position. So instead of calculating the positions in a step way, we defined the positions based on a path (a Bezier path), this Bezier would be affected by the noise so we only would need one texture to save the Bezier points.

One good thing about using a Bezier path is that you can use the linear interpolation of the textures to get the current particle position in the curve, this way we don´t have to save all the particles  positions, instead we only have to save some Bezier points in a texture (to be concrete 512 points) and then we read the particles positions using the gl.LINEAR function to calculate the point interpolation from the texture reading (all of this for free).

So if you can have as many paths as the height of your texture, and each path can have as many points as the width of the texture. In our case we only used to paths, you can see some images of the implementation.

The previous image shows the positions texture using two paths, and the 3D result of the Bezier applying the curl noise. This way we don´t need to save the positions in three textures. The bad thing about it is that all the particles are constrained to the Bezier path, so we cannot perform some particles dispersion.

Instead of applying the curl noise to the Bezier path we could use some of the total particles to create a volumetric path, and use some more to make some particles dispersion with the first technique. In the next image you can see how the Bezier paths are rendered using just one texture.

So if we use the Bezier path to move some initial particles, and we also move the emitter position over the path, and we finally define the exit initial velocity for the dispersion particles in the emitter using the bezier´s tangent, we think we could make something like this, but with much less particles 🙁

### Shadow Particles (part II: Optimizations)

28, Jun 2011/Categories 3D, webGL/No Comments

VIEW THE EXAMPLE

In the last post we talked about the implementation of volume shadows in a particle system, this last approach used a couple of for loops in order to define the final shadow intensity for each particle.  This loops could make that the application has to evaluate 32 * nParticles the shadow map, making the process somehow very low.

So we tried a pair of optimizations that would deal with the defects of the previous work, the first thing we have done is to change all the texture reading to the fragment shader, and then we tried to make the sum of the shadows for each step with the fragment shader also. At the end of the working process we defined the next steps to get the shadows done.

The first step is no different that the last approach, we defined 64 buckets in the vertex shader using the depth of the particles assuming that all of them are inside a bounding box, then we render them in a shadow framebuffer defining a color depending on the shadowDarkness variable. For this step we use the same getDepthSlice function in the vertex Shader.

void getDepthSlice(float offset, mat4 transformMatrix) {
vec3 vertexPosition = mix(aVertexFinalPosition, aVertexPosition, uMorpher);
eyeVector = vec4(vertexPosition, 1.0);
eyeVector = transformMatrix * eyeVector;
gl_Position = uPMatrix * eyeVector;
desface.z = floor(desface.w);
desface.x = mod(desface.z, 8.0);
desface.y = floor(desface.z / 8.0);
desface.y = 8.0 – desface.y;
gl_Position.x -= gl_Position.w;
gl_Position.x += gl_Position.w / 8.0;
gl_Position.x += 2.0 * gl_Position.w * desface.x / 8.0;
gl_Position.y += gl_Position.w;
gl_Position.y -= gl_Position.w / 8.0;
gl_Position.y -= 2.0 * gl_Position.w * desface.y / 8.0;
}

There is one very important difference in this function from the previous one, this new one requires two parameters, one offset and the transformMatrix. The second one is used to tell the function which view (camera view or light view) is used to define the buckets, this is usefull If you try to use the buckets system to sort the particles and blend the buckets.

The first parameters offers you the possibility to obtain the next or previous bucket from a given depth, with this we can interpolate from two buckets or layers, that would give us a linear gradient of shadows between two layers instead of 64 fixed steps.

This step is the main optimization from this code, what it does is to blend the previous bucket into the next so you can define all the shadows for each layer in 64 steps. Blending the buckets require to use the gl.BLEND and to define the gl.blendFunc using gl.ONE and gl.ONE.

Then you have to render one quad to the shadow framebuffer using the shadow map of the shadow framebuffer as a texture, the most important thing here is to get the bucket coordinates for each of the 64 passes for the quad.  This is also done in the vertex shader, with each pass, one depth is defined for the quad and the vertex shader defines the UV texture coordinates using the next chunk of code

desface.x = mod(uDepth, 8.0);
desface.y = floor((uDepth) / 8.0);
gl_Position = vec4(aVertexPosition, 1.0);
gl_Position.xy += vec2(1.0, 1.0);
gl_Position.xy /= 2.0;
gl_Position.xy = (gl_Position.xy + desface.xy) / 4.0;
gl_Position.xy -= vec2(1.0, 1.0);
desface.x = mod((uDepth – 1.0), 8.0);
desface.y = floor((uDepth – 1.0) / 8.0);
vTextureUV.xy = (desface.xy + aVertexUV.xy) / 8.0;
}

There is no big deal here, what it does is to define the coordinates depending on the depth written as X,Y values (depth = 8 * y + x), as you can read we define one vec2 call “desface” that holds the X,Y values used for the texture. The next image shows the result of this blending

Third Step: Blur passes

This step has no difficulty, but requires the use of a composition framebuffer for a intermediate destination, this is made to avoid the forward loops in one texture. We defined the blur in two differents passes, one for the X component and another for the Y component, we used a simple box blur but you can perform a gausian blur or some other blur you like.

Fourth Step: Particles rendering

In this final step we render all the particles to the window framebuffer using the shadow texture, the lightTransformMatrix and the cameraTransformMatrix, first we transform then to the light view to read the correct shadow intensity from the map.

In this part of the process you could read the shadow force for the particle depending on the depth and the bucket, but this would make that all the particles in the same bucket would show the same “force” even though their initial depth are different. This is when we use the offset parameter from the getDepthSlice function, for a given depth and bucket we can obtain the current bucket and the next one and interpolate the shadow force of the current depth between those two layer.

Once the shadow force is defined we have to transform the particle position to the camera view to obtain the light attenuation and position of the particle, and finally render it.

In the vertex shader we have this chuck of code:

getDepthSlice(1.0, uLightTransformMatrix);
vTextureUVNext = vec2((gl_Position.x / gl_Position.w + 1.0) / 2.0, (gl_Position.y / gl_Position.w + 1.0) / 2.0);
vInterpolator = desface.z – desface.w;
getDepthSlice(0.0, uLightTransformMatrix);
vTextureUV = vec2((gl_Position.x / gl_Position.w + 1.0) / 2.0, (gl_Position.y / gl_Position.w + 1.0) / 2.0);
vec3 vertexPosition = mix(aVertexFinalPosition, aVertexPosition, uMorpher);
eyeVector = vec4(vertexPosition, 1.0);
vec4 lightVector = vec4(uLightVector, 1.0);
float distance = abs(dot(lightVector.xyz , eyeVector.xyz) – dot(lightVector.xyz , lightVector.xyz)) / length(lightVector.xyz);
vAttenuation = distance * distance;
vAttenuation = vAttenuation < 1.0 ? 1.0 : vAttenuation;
eyeVector = uCameraTransformMatrix * eyeVector;
gl_Position = uPMatrix * eyeVector;
}

Here we get the uvTexture coordinates for the current bucket and the next one (notice that we call the getDepthSlice function using an offset of 0.0 and 1,0), there is also the attenuation calculation in for each particle in this shader.

In the fragment shader the code goes like this:

vec4 color1 = vec4(0.7, 0.3, 0.2, 0.05);
vec4 color2 = vec4(0.9, 0.8, 0.2, 1.0);
gl_FragColor = mix(color2, color1, uMorpher);
if(uUseInterpolation) {
float shadowForce = texture2D(uSampler, vec2(vTextureUV.x, vTextureUV.y)).r;
float shadowForceNext = texture2D(uSampler, vec2(vTextureUVNext.x, vTextureUVNext.y)).r;
} else {
}
if(uUseAttenuation) gl_FragColor.rgb = 1.0 * gl_FragColor.rgb / vAttenuation;
gl_FragColor += ambientColor;
}

Here we can define a color tint for the shadow, we also read here the shadow from the texture in the current bucket and the next one to perform the interpolation fot the final shadow intensity.

And that is it, four steps to render the shadows, this steps gives you the chance to blur the shadows, make some interpolations between two shadow layers and the possibility to tint the shadows. But it has some disadvantages too, this process requires multiple changes of render targets, and it requires a good variables definition to make it work.

In the example (click on the first image if you haven´t done it yet) you can see these variables to play with:

shadowDarkness: as it name says, it makes the shadow darkness or lighter inside the volume and the shadow applied in the floor.

– lightAngle: it allows you to change the light direction from 0 to 45 degrees, this helps you to see how the shadows change in the volume and on the floor depending on the position of the directional light (remember that it´s always pointing to the center of the object).

– cameraRatio: this ratio gets you closer or farther to the volume.

– useAttenuation: it toggles the light attenuation over the distante (squared), if on  you will see the effects of the shadows plus the attenuation, if off you will see only the shadows in the volume.

– useInterpolation: it toggles the interpolation between two buckets for a givven depth of a particle in those two layers. With a high bRatio and this variable off you should see the segmentations of the shadow in the volume.

– showTexture: it shows you the final shadow map used for the volume shadows and the floor shadow. To read it right you should start from botton to the top. The closest points to the light are in the left down corner of the image, and the final shadow (the one used in the floor) is in the top right corner of the imagen.

– bRatio: this is the most important variable of the whole process, this float defines the bounding sphere ratio that contains the volume. You must define one sphere that holds all the particles inside or you will find some glitches in your rendering. But beware, you could be tempted to define a big bounding sphere for the current volume, but what you are really doing is assigning less layers for the volume.

For this example is really ease to define the bRatio, but when you are animating the lights and the particles things starts to get messy and the only thing left is to define a big bounding sphere. If this is the case you could end up with 16 to 8 layers for one volume, and that is when the interpolation of the buckets information helps. Try to play around with this variable and turn on and off the interpolation and you will see the effects of it.

As we said before we hope we could write a third article using this method to do some animations.

15, Jun 2011/Categories 3D, webGL/No Comments

VIEW EXAMPLE

About one year without writing so, this is the first of three very verbose posts about shadow particles…

One year and a half ago Félix and me were talking about how we could implement a good shading effect for a particle system made in Flash. In those days people could render 100.000 particles with no problem using point lighting or color shading based on the position of the particle, but we wanted something more, and there is when we started to talk about shadows on particles.

Shadowing the particles was not an easy task (for Flash in those days), but we could find very useful information about it in the web, the GPU GEMS has a very good article about rendering volume objects, and we also found a very handy paper about smoke particles written by Nvidia.

Both articles talked about the “half angle rendering”, this kind of rendering uses the half angle between the view direction and the light direction to sort the particles and calculate the shadows for them. The bad news about it is that we must sort the particles with that algorithm.

We tried to implement the algorithm in Flash but it was a total failure, because of that we parked the idea for a not so immediate future. Some time later we got news about webGL and Stage3D (Molehill) so we got very excited about the idea of having hardware acceleration (and the possibility to implement the shadows). Even though we still love Flash (with its future acceleration).

Two weeks ago we finally got some time to play around with webGL, so we started to learn the insides of it, we really recommend the reading of this page, it has many tuttotials that actually helped us. We also took a look to the OpenGL ES 2.0 Specification in order to see what functions we could for the shading.

Another good thing to get in touch with webGL is to read to “classes” of the Three.js Framework, these are WebGLRenderer and WebGLShaders, the first one is very explicit of how to render objects with webGL, but the second one is a must to read, this class wraps all the shaders written and you can explore all the shaders that Three.js has to get an idea of how to program your own shaders.

Shadowing particles is quite easy in OpenGL if you have all the extensions that it offers in the desktop (the full OpenGL version), but in the web (OpenGL ES 2.0) there are some issues that has to be solved to make it work, these issues are:

– There is no 3d Texture in webGL.
– There is no gl_FragData in the fragment shader.
– Sorting particles in the web should be done by Javascript or in the best case with the graphic card.

If we don´t have 3d Textures we can´t save the shadow information in layers as we first wanted, by the other hand if there is no way to use gl_FragData in the fragment shader we must render the different information in different passes for each channel data we want. Finally we decided that sorting in JS is not an option, and trying to sort in the graphic card is one task we didn´t even consider (too hard for us ☹).

So at the end we decided to use the following algorithm:

– All the particles would be “part” of one “object”, by this we mean that we would make only one call to the graphic card to paint all the particles, we would provide one array buffer of the quantity of the particles and their initial positions. This means that all the animations would be made in the vertex shader.

– Render the shadow information in a layered 2D texture: we would use a 8×8 layered texture in order to have 64 shadow layers, the main idea behind it is that each layer would represent the shadow that affects the next layer. It is like if you slice the volume in parts and you render the shadow that the first part gives to the second, the second to the third and so on.

– Use Render to texture from a framebuffer to define the 2D shadow texure: since OpenGL Es gives us the possibility to use framebuffers we thought that we should render the information in one buffer in order to get the texture. In the next pass we could use the texture in the shaders to define the points colors.

– Use a second pass to define the final color of the particles and paint them the window framebuffer.

So as you could read before we would use two passes rendering all the particles at once, the first pass would get the shadow information and the final pass would render the particles with the “correct” color.

The shadow map that we wanted to obtain is something like the following picture, it has 64 “buckets” for 64 different depth defined with transforming the vertex positions to the light UCS. Once the transformation is done the 64 layers must be between the minimum and the maximum z of the volume containing all the particles.

The bad news of this is that you must have a bounding box to define the zMin and the zMax, and it gets worst when you animate your geometry. To solve this instead of a bounding box we decided to use a bounding sphere that could give us one measure (the ratio) that contains all the geometry, this is how the “shadowBoundingRatio” was born.

The “shadowBoundingRatio” is a ratio defined by the user (the programmer) based on the geometry that is going to be renderer (if you are going to animate the geometry there are some hacks that could be done to properly change this variable).

Now that we have one way to define the 64 depth steps, we only have to use the vertex shader to render the particles in their bucket, the bucket is defined using these equations:

desface.x = mod(desface.z, sqrtSteps);
desface.y = floor(desface.z / sqrtSteps);

The first line of code define the actual depth step of the particle based on the shadowBoundingRatio (uniform uBoundingRadio in the vertex shader) and the uShadowRatio (we explain it later). The second and third line define the x and y bucket position from 0 to 7 in the 64 buckets texture to properly render the particle.

The uniform variable uShadowRatio is the equivalent of the camera distance from one target. It allows us to adapt the slice render to the position in the texture (think of it of the only way that you could see all the information without mixing it in other buckets o rendering it too small).

Once the depth step and the actual bucket are defined we only have to place the particle in the framebuffer, we alter the gl_Position of the particle using the previous information, so getting the actual bucket and defining the gl_Position is made with this function in the vertex shader:

void getDepthSlice() {

eyeVector = uLightTransformMatrix * eyeVector;
gl_Position = uPMatrix * eyeVector;
desface.x = mod(desface.z, sqrtSteps);
desface.y = floor(desface.z / sqrtSteps);

gl_Position.x -= gl_Position.w;
gl_Position.x += gl_Position.w / sqrtSteps;
gl_Position.x += 2.0 * gl_Position.w * desface.x / sqrtSteps;

gl_Position.y += gl_Position.w;
gl_Position.y -= gl_Position.w / sqrt(steps);
gl_Position.y -= 2.0 * gl_Position.w * desface.y / sqrtSteps;

}

Moving each vertex requires to know that the each position is normalized, this means that the each factor goes from the [0-1] range. The final position is defined by this equation:

X = 0.5 * (gl_Position.x / gl_Position.w + 1.0) * ViewportSize
Y = 0.5 * (gl_Position.y / gl_Position.w + 1.0) * ViewportSize

So every move have to be done taking into account the “w” factor of the position, that is why we move the particles relative to the “w” factor in the previous function. A further reading of this is highly recommended here.

With the previous function in the vertex shader we obtain one framebuffer that we could render to a texture for the next render pass, you can see one example of it clicking in the next image. This example is the shadow map for the main example, so you could se how the shadow changes because of the light´s movement.

The last example shows a white map with black particles, actually in the framebuffer we use a vec4(0.0, 0.0, 0.0, 0.0) rgba background color, and the particles are painted with a 1.0 / 64.0 * vec4(1.0, 1.0, 1.0, 0.0) particle color. We make this so if we want that a particle has no light incidence all the 64 layers should give shadow to the particle, the less layers affect the particle position, the less dark the shadow would be in the actual particle.

Once the particles are draw in the framebuffer we only have to get the final texture to the next pass, one good example of how to make this appears in this tutorial from LearningWebGL.

Rendering the second pass:

The second pass is used to simulate the light interaction with each particle, since particles does not have a normal not very much can be done with real lights simulations, but one thing can be done and is to define a light attenuation based on the distance from the particle to the light source. If the distance is defined with the light position, we can get a point light source, if it is defined with the plane conformed with the light vector (light.target – light.position) and the light position, we can define a direction light.

Since the shadows in the layers are renderer parallel to each layer, the final shadows are relative to a directional light source, but in our approximation we have a mix of point light attenuation and directional lights shadows (this is because the point to point is easier to calculate than a plane to point distance). But if you want to be rigorous you can change the attenuation distance calculation for a direction light.

The final attenuation is by definition the square of the distance obtained (via directional or point light), but it can be defined by any power of a fractional number to obtain a desired effect (linear attenuation, square attenuation, cubic attenuation).

To change the shadow darkness you can change the divider of the vec4 color used in the vertex shader (the 1.0/64.0), if the 64.0 divisor is slower you would have darker shadows, higher values would make the shadow softer.

If you think the attenuation is right, and the shadows are fine, but the whole thing is too dark, you can always perform an ambient light (adding a fractional color to all the particles) after you implement the shadow and attenuation.

Not everything is perfect….

The second pass has a little issue that we will implement for a future post, each particle has a shadow contribution of all the previous depth layers, so you have to read the shadow information from all the previous layers for each particle and them sum the collected information to get the shadow intensity for this particle.

This means that “for” bucles must be used in the vertex shader, so for the first depth there are no contributions of previous depths but for the n depth we have a contribution of (n-1) depths, and this make the shading quite slowly compared to the only use of the attenuation. How slowly does it make it, well if we calculate the sum of n from n = 1 to n = 64 we have 2048 passes (it is not THAT bad), but this number has to be divided by the total depths (64) so it gives us that it has a 32 textures readings for each depth (this slows down the performance a lot).

There is a work around to this problem, and it is as simple as to apply 64 more passes to the framebuffer to sum the individual buckets, so the first bucket would have no information, the second would be the sum of the first and the second, the third would be the sum of the first, the second and the third and so on.

In each step of the composition we add to the next bucket the previous sum of the previous step, each bucket would have all the information in itself. This means that we do not need to use the “for” bucles to read the texture and sum the components in the shader. Another benefit of this composition is that the final bucket contains all the shadow information that the volume produce on one surface, so you can project the shadow from the particles to any object in your scene.

This optimization of the code also allow us to redefine the second rendering pass in order to recreate a blending among the particles, we can render each depth with its final color like we did in the shadow texture and then we would have the particles sorted in layers. Finally we only would add each part of the frameBuffer color texture to get the proper blending to transparent particles.

There is another hack for the “for” issues, it´s actually what we are doing in this example, and it´s to “assume” that the stronger shadows for one leyer are the ones closest to it, so you don´t have to sum all the previous layers for one depth, but a limit of them (we are using only ten layers). The drawback of this hack is that it gives many artifacts (but it´s a little faster).

In the next post we will make the improvements we have talked before, a second post will talk about the composition in the frameBuffer (for the shadow texture and the second pass of the color texture), and finally a third post will talk about how to use this algorithm to animate particles (taking into account the limitations of the bounding sphere ratio), and how to render the particles shadow to a surface.

For this time we have a little example that shows all we have discussed in this post, one simple volume that you can press the scene to drag the particles, turning on and off the light attenuation and shadows to see the shading on the them. You can see it clicking on the image at the beginning of the post.