BM3DCUDA

From Avisynth wiki
Jump to: navigation, search
Abstract
Author WolframRhodium
Version test10 (58dbc0a)
Download BM3DCUDA_AVS-test10.zip
Category Denoisers
License GPLv2
Discussion Doom9 Forum

Contents

Description

BM3D denoising filter for AviSynth+, implemented in CUDA. Also includes a cpu version implemented in AVX and AVX2 intrinsics that serves as a reference implementation on CPU. However, bitwise identical outputs are not guaranteed across CPU and CUDA implementations.

Requirements

  • CPU with AVX support (AVX2 required for BM3D_CPU).
  • CUDA-enabled GPU(s) of compute capability 5.0 or higher (Maxwell+).
  • GPU driver 450 or newer.


Syntax and Parameters

BM3D_CPU (clip, clip "ref", float[] "sigma", int[] "block_step", int[] "bm_range", int "radius", int[] "ps_num", int[] "ps_range", bool "chroma")

BM3D_CUDA (clip, clip "ref", float[] "sigma", int[] "block_step", int[] "bm_range", int "radius", int[] "ps_num", int[] "ps_range", bool "chroma", int "device_id", bool "fast", int "extractor_exp")


clip   =
The input clip. Must be planar 32 bit float format. Each plane is denoised separately if chroma is set to False.


clip  ref =
The reference clip. Must be of the same format, width, height, number of frames as clip.


float[]  sigma = [3.0,3.0,3.0]
The strength of denoising.
The strength is similar (but not strictly equal) as VapourSynth-BM3D due to differences in implementation. (coefficient normalization is not implemented, for example).


int[]  block_step = [8,8,8]
Sliding step to process every next reference block, valid range [1-8].
Total number of reference blocks to be processed can be calculated approximately by (width / block_step) * (height / block_step).
Smaller step results in processing more reference blocks, and is slower.


int[]  bm_range = [9,9,9]
Length of the side of the search neighborhood for block-matching.
The size of search window is (bm_range * 2 + 1) x (bm_range * 2 + 1).
Larger is slower, with more chances to find similar patches.


int  radius = 0
The temporal radius for denoising, valid range [1, 16].
For each processed frame, (radius * 2 + 1) frames will be requested, and the filtering result will be returned to these frames by BM3D_VAggregate.
Increasing radius only increases tiny computational cost in block-matching and aggregation, and will not affect collaborative filtering, but the memory consumption can grow quadratically.
Thus, feel free to use large radius as long as your RAM is large enough :D


int[]  ps_num = [2,2,2]
The number of matched locations used for predictive search, valid range [1-8].
Larger value increases the possibility to match more similar blocks, with tiny increasing in computational cost. But in the original MATLAB implementation of V-BM3D, it's fixed to 2 for all profiles except "lc", perhaps larger value is not always good for quality?


int[]  ps_range = [4,4,4]
Length of the side of the search neighborhood for predictive-search block-matching, valid range [1, +inf).


Note: parameters sigma, block_step, bm_range, ps_num, and ps_range are arrays. If chroma is set to True, only the first value is in effect. Otherwise an array of values may be specified for each plane (except radius).


bool  chroma = false
CBM3D algorithm. Input clip must be of YUV444PS format.
Y channel is used in block-matching of chroma channels.


int  device_id = 0
Set GPU to be used.


bool  fast = true
Multi-threaded copy between CPU and GPU at the expense of 4x memory consumption.


int  extractor_exp = 0
Used for deterministic (bitwise) output. This parameter is not present in the cpu version since the implementation always produces deterministic output.
Pre-rounding is employed for associative floating-point summation.
The value should be a positive integer not less than 3, and may need to be higher depending on the source video and filter parameters.


BM3D_VAggregate

BM3D_VAggregate should be called after temporal filtering.

BM3D_VAggregate (clip, int "radius")

clip   =
The input clip. Must be of 32 bit float format.


int  radius = 0
Same as BM3D.


Examples

DGSource("sample.dgi")
ConvertBits(bits=32)
BM3D_CUDA(sigma=0.5, radius=2)
BM3D_VAggregate(radius=2)
ConvertBits(bits=16)


Changelog

Version      Date            Changes
test10 2023/01/25 - update to cuda 12 - bug fixes - add support for Ada Lovelace - remove support for Kepler and x86 test9 2022/07/15 - fix temporal padding (x86 build is deprecating) test8 2022/02/15 - remove avx test7 2022/02/14 - fix performance regression introduced in test6 - restore avx on win64 and msvc rt dynamic linking) test6 2022/02/14 - add support for cc 3.5, links to the static msvc rt, remove avx) - (don't use; accidentally build with debug config) test5 2021/10/16 - bm_range now defaults to 9 instead of 8, fixes parameter check test4 2021/09/08 - fix array parameter test3 2021/08/06 - Separates VAggregate and compiles for AVX test2 2021/08/01 - CPU version test1 2021/07/25 - Initial release


External Links

  • GitHub - Source code repository.




Back to External Filters

Personal tools