Utility functions

Optimisation

theano.sandbox.gpuarray.opt_util.alpha_merge(cls, alpha_in, beta_in)

Decorator to merge multiplication by a scalar on the output.

This will find a pattern of scal * <yourop>(some, params, alpha, beta) and update it so that the scalar multiplication happens as part of your op.

The op needs to accept an alpha and a beta scalar which act this way:

out = Op() * alpha + out_like * beta

Where out_like is a buffer that has the same size as the output and gets added to the “real” output of the operation. An example of an operation that respects this pattern is GEMM from blas.

The decorated function must have this signature:

maker(node, *inputs)

The node argument you recieve is the original apply node that contains your op. You should use it to grab relevant properties for your op so that the new version performs the same computation. The *inputs parameters contains the new inputs for your op. You MUST use those inputs instead of the ones on node. Note that this function can be as simple as:

def maker(node, *inputs):
    return node.op(*inputs)
Parameters:
  • cls (op class) – The class of the op you want to merge
  • alpha_in (int) – The input index for the alpha scalar for your op (in node.inputs).
  • beta_in (int) – The input index for the beta scalar for your op (in node.inputs).
Returns:

an unregistered local optimizer that has the same name as the decorated function.

Return type:

local optimizer

Notes

This was factored out since the code to deal with intervening transfers and correctness in the presence of different values of alpha and beta scaling factors is not trivial.

theano.sandbox.gpuarray.opt_util.find_node(v, cls, ignore_clients=False)

Find the node that has an op of of type cls in v.

This digs through possibly redundant transfers to for the node that has the type cls. If ignore_clients is False (the default) it will only dig through nodes that have a single client to avoid duplicating computations.

Parameters:
  • v – The variable to dig through
  • cls (Op class) – The type of the node we are looking for
  • ignore_clients (bool, optional) – Whether to ignore multiple clients or not.
theano.sandbox.gpuarray.opt_util.grab_cpu_scalar(v, nd)

Get a scalar variable value from the tree at v.

This function will dig through transfers and dimshuffles to get the constant value. If no such constant is found, it returns None.

Parameters:
  • v – Theano variable to extract the constant value from.
  • nd (int) – Expected number of dimensions for the variable (for broadcasted constants).
theano.sandbox.gpuarray.opt_util.inplace_allocempty(op, idx)

Wrapper to make an inplace optimization that deals with AllocEmpty

This will duplicate the alloc input if it has more than one client to allow the op to work on it inplace.

The decorated function must have this signature:

maker(node, inputs)

The node argument you recieve is the original apply node that contains your op. You should use it to grab relevant properties for your op so that the new version performs the same computation. You should also switch the op to work inplace. The *inputs parameters contains the new inputs for your op. You MUST use those inputs instead of the ones on node. Note that this function can be as simple as:

def maker(node, inputs):
    return [node.op.__class__(inplace=True)(*inputs)]
Parameters:
  • op (op class) – The op class to look for to make inplace
  • idx (int) – The index of the (possibly) AllocEmpty input (in node.inputs).
Returns:

an unregistered inplace local optimizer that has the same name as the decorated function.

Return type:

local optimizer

theano.sandbox.gpuarray.opt_util.is_equal(var, val)

Returns True if var is always equal to val.

This will only return True if the variable will always be equal to the value. If it might not be true in some cases then it returns False.

Parameters:
  • var – Variable to compare
  • val – Python value
theano.sandbox.gpuarray.opt_util.output_merge(cls, alpha_in, beta_in, out_in)

Decorator to merge addition by a value on the output.

This will find a pattern of val * <yourop>(some, params, alpha, beta, out_like) and update it so that the addtition happens as part of your op.

The op needs to accept an alpha and a beta scalar which act this way:

out = Op() * alpha + out_like * beta

Where out_like is a buffer that has the same size as the output and gets added to the “real” output of the operation. An example of an operation that respects this pattern is GEMM from blas.

The decorated function must have this signature:

maker(node, *inputs)

The node argument you recieve is the original apply node that contains your op. You should use it to grab relevant properties for your op so that the new version performs the same computation. The *inputs parameters contains the new inputs for your op. You MUST use those inputs instead of the ones on node. Note that this function can be as simple as:

def maker(node, *inputs):
    return node.op(*inputs)
Parameters:
  • cls (op class) – The class of the op you want to merge
  • alpha_in (int) – The input index for the alpha scalar for your op (in node.inputs).
  • beta_in (int) – The input index for the beta scalar for your op (in node.inputs).
  • out_in (int) – The input index for the out_like input for your op (in node.inputs).
Returns:

an unregistered local optimizer that has the same name as the decorated function.

Return type:

local optimizer

Notes

This was factored out since the code to deal with intervening transfers and correctness in the presence of different values of alpha and beta scaling factors is not trivial.

This also correctly handles the case where the added value is broadcasted (by not performing the replacement).

Kernel generation

Helper routines for generating gpu kernels for nvcc.

theano.sandbox.gpuarray.kernel_codegen.code_version(version)

Decorator to support version-based cache mechanism.

theano.sandbox.gpuarray.kernel_codegen.inline_reduce(N, buf, pos, count, manner_fn)

Return C++ code for a function that reduces a contiguous buffer.

Parameters:
  • N – Length of the buffer.
  • buf – buffer pointer.
  • pos – Index of executing thread.
  • count – Number of executing threads.
  • manner_fn

    A function that accepts strings of arguments a and b, and returns c code for their reduction.

    return “%(a)s + %(b)s”

    for a sum reduction.

Notes

buf should be in gpu shared memory, we access it many times.

This function leaves the answer in position 0 of the buffer. The rest of the buffer is trashed by this function.

theano.sandbox.gpuarray.kernel_codegen.inline_reduce_fixed_shared(N, buf, x, stride_x, load_x, pos, count, manner_fn, manner_init, b='', stride_b='', load_b='', dtype='float32')

Return C++ code for a function that reduces a contiguous buffer.

This function leaves the answer in position 0 of the buffer. The rest of the buffer is trashed by this function.

Parameters:
  • N – Length of the buffer.
  • buf – Buffer pointer of size warpSize * sizeof(dtype).
  • x – Input data.
  • stride_x – Input data stride.
  • load_x – Wrapper to read from x.
  • pos – Index of executing thread.
  • count – Number of executing threads.
  • b – Optional, pointer to the bias.
  • stride_b – Optional, the stride of b if b is provided.
  • load_b – Optional, wrapper to read from b if b is provided.
  • dtype – Optional, the dtype of the output.
  • manner_fn

    A function that accepts strings of arguments a and b, and returns c code for their reduction.

    return “%(a)s + %(b)s”

    for a sum reduction.

  • manner_init – A function that accepts strings of arguments a and return c code for its initialization.

Notes

buf should be in gpu shared memory, we access it many times.

theano.sandbox.gpuarray.kernel_codegen.inline_softmax(N, buf, buf2, threadPos, threadCount, dtype='float32')

Generate code for a softmax.

On entry, buf and buf2 must contain two identical copies of the input to softmax.

After the code returns buf contains the softmax, buf2 contains un-normalized softmax.

Parameters:
  • N – Length of the buffer.
  • threadPos – Index of executing thread.
  • threadCount – Number of executing threads.
  • dtype – Dtype of the softmax’s output.

Notes

buf and buf2 should be in gpu shared memory, we access it many times.

We use __i as an int variable in a loop.

theano.sandbox.gpuarray.kernel_codegen.inline_softmax_fixed_shared(N, buf, x, stride_x, load_x, sm, sm_stride, write_sm, threadPos, threadCount, b='', stride_b='', load_b='', dtype='float32')

Generate code to perform softmax with a fixed amount of shared memory.

On entry, buf is assumed to be empty.

On exit, buf[0] contains the softmax, buf2 contains un-normalized softmax.

Parameters:
  • N – Length of the buffer, atleast waprSize(32).
  • buf – A shared memory buffer of size warpSize * sizeof(dtype).
  • x – A ptr to the gpu memory where the row is stored.
  • stride_x – The stride between each element in x.
  • load_x – Wrapper to read from x.
  • sm – A ptr to the gpu memory to store the result.
  • sm_stride – The stride between each sm element.
  • write_sm – Wrapper before writing to sm.
  • threadPos – Index of executing thread.
  • threadCount – Number of executing threads.
  • b – Optional, pointer to the bias.
  • stride_b – Optional, the stride of b if b is provided.
  • load_b – Optional, wrapper to read from b if b is provided.
  • dtype – Optional, the dtype of the softmax’s output if not float32.

Notes

buf should be in gpu shared memory, we access it many times.

We use tx as an int variable in a loop.

theano.sandbox.gpuarray.kernel_codegen.nvcc_kernel(name, params, body)

Return the c code of a kernel function.

Parameters:
  • params – The parameters to the function as one or more strings.
  • body – The [nested] list of statements for the body of the function. These will be separated by ‘;’ characters.