BM3DCUDA

Abstract
Author	WolframRhodium
Version	test10 (58dbc0a)
Download	BM3DCUDA_AVS-test10.zip
Category	Denoisers
License	GPLv2
Discussion	Doom9 Forum

Description

BM3D denoising filter for AviSynth+, implemented in CUDA. Also includes a cpu version implemented in AVX and AVX2 intrinsics that serves as a reference implementation on CPU. However, bitwise identical outputs are not guaranteed across CPU and CUDA implementations.

Requirements

[x64]: AviSynth+
Supported color formats: 32-bit Y/YUV/RGB planar

CPU with AVX support (AVX2 required for BM3D_CPU).
CUDA-enabled GPU(s) of compute capability 5.0 or higher (Maxwell+).
GPU driver 450 or newer.

Syntax and Parameters

BM3D_CPU (clip, clip "ref", float[] "sigma", int[] "block_step", int[] "bm_range", int "radius", int[] "ps_num", int[] "ps_range", bool "chroma")

BM3D_CUDA (clip, clip "ref", float[] "sigma", int[] "block_step", int[] "bm_range", int "radius", int[] "ps_num", int[] "ps_range", bool "chroma", int "device_id", bool "fast", int "extractor_exp")

clip =

The input clip. Must be planar 32 bit float format. Each plane is denoised separately if chroma is set to False.

clip ref =

The reference clip. Must be of the same format, width, height, number of frames as clip.

float[] sigma = [3.0,3.0,3.0]

The strength of denoising.

The strength is similar (but not strictly equal) as VapourSynth-BM3D due to differences in implementation. (coefficient normalization is not implemented, for example).

int[] block_step = [8,8,8]

Sliding step to process every next reference block, valid range [1-8].

Total number of reference blocks to be processed can be calculated approximately by (width / block_step) * (height / block_step).

Smaller step results in processing more reference blocks, and is slower.

int[] bm_range = [9,9,9]

Length of the side of the search neighborhood for block-matching.

The size of search window is (bm_range * 2 + 1) x (bm_range * 2 + 1).

Larger is slower, with more chances to find similar patches.

int radius = 0

The temporal radius for denoising, valid range [1, 16].

For each processed frame, (radius * 2 + 1) frames will be requested, and the filtering result will be returned to these frames by BM3D_VAggregate.

Increasing radius only increases tiny computational cost in block-matching and aggregation, and will not affect collaborative filtering, but the memory consumption can grow quadratically.

Thus, feel free to use large radius as long as your RAM is large enough :D

int[] ps_num = [2,2,2]

The number of matched locations used for predictive search, valid range [1-8].

Larger value increases the possibility to match more similar blocks, with tiny increasing in computational cost. But in the original MATLAB implementation of V-BM3D, it's fixed to 2 for all profiles except "lc", perhaps larger value is not always good for quality?

int[] ps_range = [4,4,4]

Length of the side of the search neighborhood for predictive-search block-matching, valid range [1, +inf).

Note: parameters sigma, block_step, bm_range, ps_num, and ps_range are arrays. If chroma is set to True, only the first value is in effect. Otherwise an array of values may be specified for each plane (except radius).

bool chroma = false

CBM3D algorithm. Input clip must be of YUV444PS format.

Y channel is used in block-matching of chroma channels.

int device_id = 0

Set GPU to be used.

bool fast = true

Multi-threaded copy between CPU and GPU at the expense of 4x memory consumption.

int extractor_exp = 0

Used for deterministic (bitwise) output. This parameter is not present in the cpu version since the implementation always produces deterministic output.

Pre-rounding is employed for associative floating-point summation.

The value should be a positive integer not less than 3, and may need to be higher depending on the source video and filter parameters.

BM3D_VAggregate

BM3D_VAggregate should be called after temporal filtering.

BM3D_VAggregate (clip, int "radius")

clip =

The input clip. Must be of 32 bit float format.

int radius = 0

Same as BM3D.

Examples

DGSource("sample.dgi")
ConvertBits(bits=32)
BM3D_CUDA(sigma=0.5, radius=2)
BM3D_VAggregate(radius=2)
ConvertBits(bits=16)

Changelog

Version      Date            Changes

test10       2023/01/25      - update to cuda 12
                             - bug fixes
                             - add support for Ada Lovelace
                             - remove support for Kepler and x86
test9        2022/07/15      - fix temporal padding (x86 build is deprecating)
test8        2022/02/15      - remove avx
test7        2022/02/14      - fix performance regression introduced in test6
                             - restore avx on win64 and msvc rt dynamic linking)
test6        2022/02/14      - add support for cc 3.5, links to the static msvc rt, remove avx)
                             - (don't use; accidentally build with debug config)
test5        2021/10/16      - bm_range now defaults to 9 instead of 8, fixes parameter check
test4        2021/09/08      - fix array parameter
test3        2021/08/06      - Separates VAggregate and compiles for AVX
test2        2021/08/01      - CPU version
test1        2021/07/25      - Initial release

External Links

GitHub - Source code repository.

Back to External Filters ←

BM3DCUDA

Contents

Description

Requirements

Syntax and Parameters

BM3D_VAggregate

Examples

Changelog

External Links

Views

Personal tools

Navigation

community

in other languages

Search

Tools