sandbox.cuda – List of CUDA GPU Op implemented

Normally you should not call directly those Ops! Theano should automatically transform cpu ops to their gpu equivalent. So this list is just useful to let people know what is implemented on the gpu.

Basic Op

class theano.sandbox.cuda.basic_ops.CopyOnNegativeStrides(use_c_code='/usr/bin/g++')

Checks if the input has contains negative strides.

If it does, returns a c contiguous copy.

class theano.sandbox.cuda.basic_ops.GpuAdvancedIncSubtensor1(inplace=False, set_instead_of_inc=False)

Implement AdvancedIncSubtensor1 on the gpu.

class theano.sandbox.cuda.basic_ops.GpuAdvancedIncSubtensor1_dev20(inplace=False, set_instead_of_inc=False)

Implement AdvancedIncSubtensor1 on the gpu, but use function only avail on compute capability 2.0 and more recent.

make_node(x, y, ilist)

It defer from GpuAdvancedIncSubtensor1 in that it make sure the index are of type long.

class theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1(sparse_grad=False)

Implement AdvancedSubtensor1 on the gpu.

class theano.sandbox.cuda.basic_ops.GpuAlloc(memset_0=False)

Implement Alloc on the gpu.

The memset_0 param is an optimization. When True, we call cudaMemset that is faster.

class theano.sandbox.cuda.basic_ops.GpuAllocEmpty(use_c_code='/usr/bin/g++')

Implement Alloc on the gpu, but without initializing memory.

class theano.sandbox.cuda.basic_ops.GpuCAReduce(reduce_mask, scalar_op, pre_scalar_op=None)

GpuCAReduce is a Reduction along some dimensions by a scalar op.

The dimensions along which to reduce is specified by the reduce_mask that you pass to the constructor. The reduce_mask is a tuple of booleans (actually integers 0 or 1) that specify for each input dimension, whether to reduce it (1) or not (0).

Parameters:pre_scalar_op – If present, must be a scalar op with only 1 input. We will execute it on the input value before reduction.

Notes

This Op is a work in progress.

This op was recently upgraded from just GpuSum a general CAReduce. Not many code cases are supported for scalar_op being anything other than scal. Add instances yet.

Important note: if you implement new cases for this op, be sure to benchmark them and make sure that they actually result in a speedup. GPUs are not especially well-suited to reduction operations so it is quite possible that the GPU might be slower for some cases.

Examples

When scalar_op is a theano.scalar.basic.Add instance:

  • reduce_mask == (1,) sums a vector to a scalar
  • reduce_mask == (1,0) computes the sum of each column in a matrix
  • reduce_mask == (0,1) computes the sum of each row in a matrix
  • reduce_mask == (1,1,1) computes the sum of all elements in a 3-tensor.
..note:: Any reduce_mask of all zeros is a sort of ‘copy’, and may
be removed during graph optimization.
c_code_reduce_01X(sio, node, name, x, z, fail, N)
Parameters:N (int) – The number of 1 in the pattern N=1 -> 01, N=2 -> 011 N=3 ->0111 Works for N=1,2,3.
c_code_reduce_ccontig(sio, node, name, x, z, fail)

WRITEME

IG: I believe, based on how this is called in c_code, that it is for the case where we are reducing on all axes and x is C contiguous.

supports_c_code(inputs)

Returns True if the current op and reduce pattern has functioning C code.

class theano.sandbox.cuda.basic_ops.GpuContiguous(use_c_code='/usr/bin/g++')

Always return a c contiguous output. Copy the input only if it is not already c contiguous.

class theano.sandbox.cuda.basic_ops.GpuDimShuffle(input_broadcastable, new_order)

Implement DimShuffle on the gpu.

class theano.sandbox.cuda.basic_ops.GpuElemwise(scalar_op, inplace_pattern=None, sync=None)

Implement a generic elemwise on the gpu.

class theano.sandbox.cuda.basic_ops.GpuFlatten

Implement Flatten on the gpu.

Note

The interface GpuFlatten is deprecated, you should use gpu_flatten.

class theano.sandbox.cuda.basic_ops.GpuFromHost(use_c_code='/usr/bin/g++')

Implement the transfer from cpu to the gpu.

class theano.sandbox.cuda.basic_ops.GpuIncSubtensor(idx_list, inplace=False, set_instead_of_inc=False, destroyhandler_tolerate_aliased=None)

Implement IncSubtensor on the gpu.

Notes

The optimization to make this inplace is in tensor/opt. The same optimization handles IncSubtensor and GpuIncSubtensor. This Op has c_code too; it inherits tensor.IncSubtensor’s c_code. The helper methods like do_type_checking, copy_of_x, etc. specialize the c_code for this Op.

copy_into(view, source)
Parameters:
  • view (str) – C code expression for an array.
  • source (str) – C code expression for an array
Returns:

A C code expression to copy source into view, and 0 on success.

Return type:

str

copy_of_x(x)
Parameters:x (str) – A string giving the name of a C variable pointing to an array.
Returns:C code expression to make a copy of x.
Return type:str

Notes

Base class uses PyArrayObject *, subclasses may override for different types of arrays.

do_type_checking(node)

Should raise NotImplementedError if c_code does not support the types involved in this node.

get_helper_c_code_args()

Return a dictionary of arguments to use with helper_c_code.

make_view_array(x, view_ndim)
Parameters:
  • x (str) – A string identifying an array to be viewed.
  • view_ndim (str) – A string specifying the number of dimensions to have in the view. This doesn’t need to actually set up the view with the right indexing; we’ll do that manually later.
class theano.sandbox.cuda.basic_ops.GpuJoin(use_c_code='/usr/bin/g++')

Implement Join on the gpu.

class theano.sandbox.cuda.basic_ops.GpuReshape(ndim, name=None)

Implement Reshape on the gpu.

class theano.sandbox.cuda.basic_ops.GpuShape(use_c_code='/usr/bin/g++')

Implement Shape on the gpu.

class theano.sandbox.cuda.basic_ops.GpuSubtensor(idx_list)

Implement subtensor on the gpu.

class theano.sandbox.cuda.basic_ops.HostFromGpu(use_c_code='/usr/bin/g++')

Implement the transfer from gpu to the cpu.

theano.sandbox.cuda.basic_ops.col(name=None, dtype=None)

Return a symbolic column variable (ndim=2, broadcastable=[False,True]).

Parameters:
  • dtype – Numeric type (None means to use theano.config.floatX).
  • name (str) – A name to attach to this variable.
theano.sandbox.cuda.basic_ops.gpu_flatten(x, outdim=1)

Implement flatten on the gpu. Reshapes the variable x by keeping the first outdim-1 dimension size(s) of x the same, and making the last dimension size of x equal to the multiplication of its remaining dimension size(s).

Parameters:
  • x (theano.tensor.var.TensorVariable) – the variable that should be reshaped.
  • outdim (int) – the number of dimensions of the returned variable
Returns:

the flattend variable with dimensionality of outdim

Return type:

theano.tensor.var.TensorVariable

theano.sandbox.cuda.basic_ops.matrix(name=None, dtype=None)

Return a symbolic matrix variable.

Parameters:
  • dtype – Numeric type (None means to use theano.config.floatX).
  • name – A name to attach to this variable.
theano.sandbox.cuda.basic_ops.row(name=None, dtype=None)

Return a symbolic row variable (ndim=2, broadcastable=[True,False]).

Parameters:
  • dtype – Numeric type (None means to use theano.config.floatX).
  • name (str) – A name to attach to this variable.
theano.sandbox.cuda.basic_ops.scalar(name=None, dtype=None)

Return a symbolic scalar variable.

Parameters:
  • dtype – Numeric type (None means to use theano.config.floatX).
  • name (str) – A name to attach to this variable.
theano.sandbox.cuda.basic_ops.tensor3(name=None, dtype=None)

Return a symbolic 3-D variable.

Parameters:
  • dtype – Numeric type (None means to use theano.config.floatX).
  • name (str) – A name to attach to this variable.
theano.sandbox.cuda.basic_ops.tensor4(name=None, dtype=None)

Return a symbolic 4-D variable.

Parameters:
  • dtype – Numeric type (None means to use theano.config.floatX).
  • name (str) – A name to attach to this variable.
theano.sandbox.cuda.basic_ops.vector(name=None, dtype=None)

Return a symbolic vector variable.

Parameters:
  • dtype – Numeric type (None means to use theano.config.floatX).
  • name – A name to attach to this variable.

Blas Op

class theano.sandbox.cuda.blas.BaseGpuCorr3dMM(border_mode='valid', subsample=(1, 1, 1), pad=(0, 0, 0))

Base class for GpuCorr3dMM, GpuCorr3dMM_gradWeights and GpuCorr3dMM_gradInputs. Cannot be used directly.

c_code_helper(bottom, weights, top, direction, sub, height=None, width=None, depth=None)

This generates the C code for GpuCorrMM (direction=”forward”), GpuCorrMM_gradWeights (direction=”backprop weights”), and GpuCorrMM_gradInputs (direction=”backprop inputs”). Depending on the direction, one of bottom, weights, top will receive the output, while the other two serve as inputs.

Parameters:
  • bottom – Variable name of the input images in the forward pass, or the gradient of the input images in backprop wrt. inputs.
  • weights – Variable name of the filters in the forward pass, or the gradient of the filters in backprop wrt. weights.
  • top – Variable name of the output images / feature maps in the forward pass, or the gradient of the outputs in the backprop passes.
  • direction ({‘forward’, ‘backprop weights’, ‘backprop inputs’}) – “forward” to correlate bottom with weights and store results in top, “backprop weights” to do a valid convolution of bottom with top (swapping the first two dimensions) and store results in weights, and “backprop inputs” to do a full convolution of top with weights (swapping the first two dimensions) and store results in bottom.
  • sub – Dictionary of substitutions useable to help generating the C code.
  • height – If self.subsample[0] != 1, a variable giving the height of the filters for direction=”backprop weights” or the height of the input images for direction=”backprop inputs”. If self.pad == ‘half’, a variable giving the height of the filters for direction=”backprop weights”. Ignored otherwise.
  • width – If self.subsample[1] != 1, a variable giving the width of the filters for direction=”backprop weights” or the width of the input images for direction=”backprop inputs”. If self.pad == ‘half’, a variable giving the width of the filters for direction=”backprop weights”. Ignored otherwise.
  • depth – If self.subsample[2] != 1, a variable giving the depth of the filters for direction=”backprop weights” or the depth of the input images for direction=”backprop inputs”. If self.pad == ‘half’, a variable giving the depth of the filters for direction=”backprop weights”. Ignored otherwise.
flops(inp, outp)

Useful with the hack in profilemode to print the MFlops

class theano.sandbox.cuda.blas.BaseGpuCorrMM(border_mode='valid', subsample=(1, 1), pad=(0, 0))

Base class for GpuCorrMM, GpuCorrMM_gradWeights and GpuCorrMM_gradInputs. Cannot be used directly.

Parameters:
  • border_mode ({‘valid’, ‘full’, ‘half’}) – Additionally, the padding size could be directly specified by an integer or a pair of integers
  • subsample – Perform subsampling of the output (default: (1, 1)).
  • paddeprecated, now you should always use border_mode.
c_code_helper(bottom, weights, top, direction, sub, height=None, width=None)

This generates the C code for GpuCorrMM (direction=”forward”), GpuCorrMM_gradWeights (direction=”backprop weights”), and GpuCorrMM_gradInputs (direction=”backprop inputs”). Depending on the direction, one of bottom, weights, top will receive the output, while the other two serve as inputs.

Parameters:
  • bottom – Variable name of the input images in the forward pass, or the gradient of the input images in backprop wrt. inputs
  • weights – Variable name of the filters in the forward pass, or the gradient of the filters in backprop wrt. weights
  • top – Variable name of the output images / feature maps in the forward pass, or the gradient of the outputs in the backprop passes
  • direction ({‘forward’, ‘backprop weights’, ‘backprop inputs’}) – “forward” to correlate bottom with weights and store results in top, “backprop weights” to do a valid convolution of bottom with top (swapping the first two dimensions) and store results in weights, and “backprop inputs” to do a full convolution of top with weights (swapping the first two dimensions) and store results in bottom.
  • sub – Dictionary of substitutions useable to help generating the C code.
  • height – If self.subsample[0] != 1, a variable giving the height of the filters for direction=”backprop weights” or the height of the input images for direction=”backprop inputs”. If self.border_mode == ‘half’, a variable giving the height of the filters for direction=”backprop weights”. Ignored otherwise.
  • width – If self.subsample[1] != 1, a variable giving the width of the filters for direction=”backprop weights” or the width of the input images for direction=”backprop inputs”. If self.border_mode == ‘half’, a variable giving the width of the filters for direction=”backprop weights”. Ignored otherwise.
flops(inp, outp)

Useful with the hack in profilemode to print the MFlops.

class theano.sandbox.cuda.blas.GpuConv(border_mode, subsample=(1, 1), logical_img_hw=None, logical_kern_hw=None, logical_kern_align_top=True, version=-1, direction_hint=None, verbose=0, kshp=None, imshp=None, max_threads_dim0=None, nkern=None, bsize=None, fft_opt=True)

Implement the batched and stacked 2d convolution on the gpu.

Parameters:
  • version – Each version of c_code implements many kernel for the convolution. By default we try to guess the best one. You can force one version with this parameter. This parameter is used by the tests.
  • direction_hint ({‘forward’, ‘bprop weights’, ‘bprop inputs’}) – Serves as a hint for graph optimizers replacing GpuConv by other implementations. If the GpuConv is inserted automatically, we take its value from ConvOp.
  • verbose – For value of 1,2 and 3. Print more information during the execution of the convolution. Mostly used for optimization or debugging.
  • kshp – The size of the kernel. If provided, can generate faster code. If the GpuConv op is automatically inserted, We take its value automatically from the Conv op.
  • imshp – The size of the image. Not used for code generation but allows to select an experimental new version in another repo.
  • max_threads_dim0 – The maximum number of threads for the block size dimensions 0 (blockDim.x) used by the GPU function.
  • nkern – The number of kernels. Not used for this op, but can be used by graph optimizers to select a more optimal convolution implementation. If the GpuConv op is inserted automatically, we take its value from the Conv op.
  • bsize – The batch size. Not used for this op, but can be used by graph optimizers to select a more optimal convolution implementation. If the GpuConv op is inserted automatically, we take its value from the Conv op.
  • fft_opt – Deactivate fft_opt optimization at the op level when set to False. Note that by default fft optimization aren’t enabled. See convolution documentation to enable them.
flops(inputs, outputs)

Useful with the hack in profilemode to print the MFlops

class theano.sandbox.cuda.blas.GpuCorr3dMM(border_mode='valid', subsample=(1, 1, 1), pad=(0, 0, 0))

GPU correlation implementation using Matrix Multiplication.

Parameters:
  • border_mode – Currently supports “valid” only; “full” can be simulated by setting pad=”full” (at the cost of performance), or by using GpuCorrMM_gradInputs.
  • subsample – The subsample operation applied to each output image. Should be a tuple with 3 elements. (sv, sh, sl) is equivalent to GpuCorrMM(...)(...)[:,:,::sv, ::sh, ::sl], but faster. Set to (1, 1, 1) to disable subsampling.
  • pad – The width of a border of implicit zeros to pad the input image with. Should be a tuple with 3 elements giving the numbers of rows and columns to pad on each side, or “half” to set the padding to (kernel_rows // 2, kernel_columns // 2, kernel_depth // 2), or “full” to set the padding to (kernel_rows - 1, kernel_columns - 1, kernel_depth - 1) at runtime. Set to (0, 0, 0) to disable padding.

Notes

Currently, the Op requires the inputs, filters and outputs to be
C-contiguous. Use gpu_contiguous on these arguments if needed.

Warning

For 700 series Nvidia GPUs of compute capability 3.5 and CUDA 5.0 to 6.0, there is a bug in CUBLAS’ matrix multiplication function that can make GpuCorrMM or its gradients crash for some input and filter shapes. So if you have a Tesla K20, Tesla K40, Quadro K6000, GeForce GT 640 (DDR5), GeForce GTX 780 (or Ti), GeForce GTX TITAN (or Black or Z) and experience a crash, switching to CUDA 6.5 or CUDA 4.2 should fix it. If this is not possible, changing the input or filter shapes (e.g., the batchsize or number of filters) may also work around the CUBLAS bug.

class theano.sandbox.cuda.blas.GpuCorr3dMM_gradInputs(border_mode='valid', subsample=(1, 1, 1), pad=(0, 0, 0))

Gradient wrt. inputs for GpuCorr3dMM.

Notes

You will not want to use this directly, but rely on Theano’s automatic differentiation or graph optimization to use it as needed.

class theano.sandbox.cuda.blas.GpuCorr3dMM_gradWeights(border_mode='valid', subsample=(1, 1, 1), pad=(0, 0, 0))

Gradient wrt. filters for GpuCorr3dMM.

Notes

You will not want to use this directly, but rely on Theano’s automatic differentiation or graph optimization to use it as needed.

class theano.sandbox.cuda.blas.GpuCorrMM(border_mode='valid', subsample=(1, 1), pad=(0, 0))

GPU correlation implementation using Matrix Multiplication.

Parameters:
  • border_mode – The width of a border of implicit zeros to pad the input with. Must be a tuple with 2 elements giving the numbers of rows and columns to pad on each side, or a single integer to pad the same on all sides, or a string shortcut setting the padding at runtime: 'valid' for (0, 0) (valid convolution, no padding), 'full' for (kernel_rows - 1, kernel_columns - 1) (full convolution), 'half' for (kernel_rows // 2, kernel_columns // 2) (same convolution for odd-sized kernels). Note that the two widths are each applied twice, once per side (left and right, top and bottom).
  • subsample – The subsample operation applied to each output image. Should be a tuple with 2 elements. (sv, sh) is equivalent to GpuCorrMM(...)(...)[:,:,::sv, ::sh], but faster. Set to (1, 1) to disable subsampling.
  • pad – Deprecated alias for border_mode.

Notes

Currently, the Op requires the inputs, filters and outputs to be C-contiguous. Use gpu_contiguous on these arguments if needed.

You can either enable the Theano flag optimizer_including=conv_gemm to automatically replace all convolution operations with GpuCorrMM or one of its gradients, or you can use it as a replacement for conv2d, called as GpuCorrMM(subsample=...)(image, filters). The latter is currently faster, but note that it computes a correlation – if you need to compute a convolution, flip the filters as filters[:,:,::-1,::-1].

..warning:: For 700 series Nvidia GPUs of compute capability 3.5 and CUDA 5.0
to 6.0, there is a bug in CUBLAS’ matrix multiplication function that can make GpuCorrMM or its gradients crash for some input and filter shapes. So if you have a Tesla K20, Tesla K40, Quadro K6000, GeForce GT 640 (DDR5), GeForce GTX 780 (or Ti), GeForce GTX TITAN (or Black or Z) and experience a crash, switching to CUDA 6.5 or CUDA 4.2 should fix it. If this is not possible, changing the input or filter shapes (e.g., the batchsize or number of filters) may also work around the CUBLAS bug.
class theano.sandbox.cuda.blas.GpuCorrMM_gradInputs(border_mode='valid', subsample=(1, 1), pad=(0, 0))

Gradient wrt. inputs for GpuCorrMM.

Notes

You will not want to use this directly, but rely on Theano’s automatic differentiation or graph optimization to use it as needed.

class theano.sandbox.cuda.blas.GpuCorrMM_gradWeights(border_mode='valid', subsample=(1, 1), pad=(0, 0))

Gradient wrt. filters for GpuCorrMM.

Notes

You will not want to use this directly, but rely on Theano’s automatic differentiation or graph optimization to use it as needed.

class theano.sandbox.cuda.blas.GpuDot22(use_c_code='/usr/bin/g++')

Implement dot(2d, 2d) on the gpu.

class theano.sandbox.cuda.blas.GpuDot22Scalar(use_c_code='/usr/bin/g++')

Implement dot(2d, 2d) * scalar on the gpu.

Notes

Not used anymore. Keep to allow unpickle of old graph.

class theano.sandbox.cuda.blas.GpuDownsampleFactorMax(ds, ignore_border=False)

Implement downsample with max on the gpu.

class theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGrad(ds, ignore_border)

Implement the grad of downsample with max on the gpu.

class theano.sandbox.cuda.blas.GpuDownsampleFactorMaxGradGrad(ds, ignore_border)

Implement the grad of downsample with max on the gpu.

class theano.sandbox.cuda.blas.GpuGemm(inplace)

implement the gemm on the gpu.

class theano.sandbox.cuda.blas.GpuGemv(inplace)

implement gemv on the gpu.

class theano.sandbox.cuda.blas.GpuGer(inplace)

implement ger on the gpu.

theano.sandbox.cuda.blas.batched_dot()

Nnet Op

class theano.sandbox.cuda.nnet.GpuCrossentropySoftmax1HotWithBiasDx(**kwargs)

Implement CrossentropySoftmax1HotWithBiasDx on the gpu.

class theano.sandbox.cuda.nnet.GpuCrossentropySoftmaxArgmax1HotWithBias(use_c_code='/usr/bin/g++')

Implement CrossentropySoftmaxArgmax1HotWithBias on the gpu.

class theano.sandbox.cuda.nnet.GpuSoftmax(use_c_code='/usr/bin/g++')

Implement Softmax on the gpu.

class theano.sandbox.cuda.nnet.GpuSoftmaxWithBias(use_c_code='/usr/bin/g++')

Implement SoftmaxWithBias on the gpu.

Curand Op

Random generator based on the CURAND libraries. It is not inserted automatically.

Define CURAND_RandomStreams - backed by CURAND.

class theano.sandbox.cuda.rng_curand.CURAND_Base(output_type, seed, destructive)

Base class for a random number generator implemented in CURAND.

The random number generator itself is an opaque reference managed by CURAND. This Op uses a generic-typed shared variable to point to a CObject that encapsulates this opaque reference.

Each random variable is created with a generator of False. The actual random number generator is allocated from the seed, on the first call to allocate random numbers (see c_code).

Parameters:
  • output_type – A theano type (e.g. tensor.fvector).
  • seed (int) –
  • destructive – True or False (on the generator)

Notes

One caveat is that the random number state is simply not serializable. Consequently, attempts to serialize functions compiled with these random numbers will fail.

as_destructive()

Return an destructive version of self.

classmethod new_auto_update(generator, ndim, dtype, size, seed)

Return a symbolic sample from generator.

cls dictates the random variable (e.g. uniform, normal).

class theano.sandbox.cuda.rng_curand.CURAND_Normal(output_type, seed, destructive)

Op to draw normal numbers using CURAND.

class theano.sandbox.cuda.rng_curand.CURAND_RandomStreams(seed)

RandomStreams instance that creates CURAND-based random variables.

One caveat is that generators are not serializable.

Parameters:seed (int) –
next_seed()

Return a unique seed for initializing a random variable.

normal(size=None, avg=0.0, std=1.0, ndim=None, dtype='float64')

Return symbolic tensor of normally-distributed numbers.

Parameters:size – Can be a list of integer or Theano variable (ex: the shape of other Theano Variable)
uniform(size, low=0.0, high=1.0, ndim=None, dtype='float64')

Return symbolic tensor of uniform numbers.

updates()

List of all (old, new) generator update pairs created by this instance.

class theano.sandbox.cuda.rng_curand.CURAND_Uniform(output_type, seed, destructive)

Op to draw uniform numbers using CURAND.