BM3DCUDA
Abstract | |
---|---|
Author | WolframRhodium |
Version | test10 (58dbc0a) |
Download | BM3DCUDA_AVS-test10.zip |
Category | Denoisers |
License | GPLv2 |
Discussion | Doom9 Forum |
Contents |
Description
BM3D denoising filter for AviSynth+, implemented in CUDA. Also includes a cpu version implemented in AVX and AVX2 intrinsics that serves as a reference implementation on CPU. However, bitwise identical outputs are not guaranteed across CPU and CUDA implementations.
Requirements
- CPU with AVX support (AVX2 required for
BM3D_CPU
). - CUDA-enabled GPU(s) of compute capability 5.0 or higher (Maxwell+).
- GPU driver 450 or newer.
Syntax and Parameters
BM3D_CPU (clip, clip "ref", float[] "sigma", int[] "block_step", int[] "bm_range", int "radius", int[] "ps_num", int[] "ps_range", bool "chroma")
BM3D_CUDA (clip, clip "ref", float[] "sigma", int[] "block_step", int[] "bm_range", int "radius", int[] "ps_num", int[] "ps_range", bool "chroma", int "device_id", bool "fast", int "extractor_exp")
- clip =
- The input clip. Must be planar 32 bit float format. Each plane is denoised separately if
chroma
is set to False.
- The input clip. Must be planar 32 bit float format. Each plane is denoised separately if
- clip =
- clip ref =
- The reference clip. Must be of the same format, width, height, number of frames as clip.
- clip ref =
- float[] sigma = [3.0,3.0,3.0]
- The strength of denoising.
- The strength is similar (but not strictly equal) as VapourSynth-BM3D due to differences in implementation. (coefficient normalization is not implemented, for example).
- float[] sigma = [3.0,3.0,3.0]
- int[] block_step = [8,8,8]
- Sliding step to process every next reference block, valid range [1-8].
- Total number of reference blocks to be processed can be calculated approximately by (width / block_step) * (height / block_step).
- Smaller step results in processing more reference blocks, and is slower.
- int[] block_step = [8,8,8]
- int[] bm_range = [9,9,9]
- Length of the side of the search neighborhood for block-matching.
- The size of search window is (bm_range * 2 + 1) x (bm_range * 2 + 1).
- Larger is slower, with more chances to find similar patches.
- int[] bm_range = [9,9,9]
- int radius = 0
- The temporal radius for denoising, valid range [1, 16].
- For each processed frame, (radius * 2 + 1) frames will be requested, and the filtering result will be returned to these frames by BM3D_VAggregate.
- Increasing radius only increases tiny computational cost in block-matching and aggregation, and will not affect collaborative filtering, but the memory consumption can grow quadratically.
- Thus, feel free to use large radius as long as your RAM is large enough :D
- int radius = 0
- int[] ps_num = [2,2,2]
- The number of matched locations used for predictive search, valid range [1-8].
- Larger value increases the possibility to match more similar blocks, with tiny increasing in computational cost. But in the original MATLAB implementation of V-BM3D, it's fixed to 2 for all profiles except "lc", perhaps larger value is not always good for quality?
- int[] ps_num = [2,2,2]
- int[] ps_range = [4,4,4]
- Length of the side of the search neighborhood for predictive-search block-matching, valid range [1, +inf).
- int[] ps_range = [4,4,4]
- Note: parameters
sigma
,block_step
,bm_range
,ps_num
, andps_range
are arrays. If chroma is set toTrue
, only the first value is in effect. Otherwise an array of values may be specified for each plane (exceptradius
).
- Note: parameters
- bool chroma = false
- CBM3D algorithm. Input clip must be of YUV444PS format.
- Y channel is used in block-matching of chroma channels.
- bool chroma = false
- int device_id = 0
- Set GPU to be used.
- int device_id = 0
- bool fast = true
- Multi-threaded copy between CPU and GPU at the expense of 4x memory consumption.
- bool fast = true
- int extractor_exp = 0
- Used for deterministic (bitwise) output. This parameter is not present in the cpu version since the implementation always produces deterministic output.
- Pre-rounding is employed for associative floating-point summation.
- The value should be a positive integer not less than 3, and may need to be higher depending on the source video and filter parameters.
- int extractor_exp = 0
BM3D_VAggregate
BM3D_VAggregate should be called after temporal filtering.
BM3D_VAggregate (clip, int "radius")
- clip =
- The input clip. Must be of 32 bit float format.
- clip =
- int radius = 0
- Same as BM3D.
- int radius = 0
Examples
DGSource("sample.dgi") ConvertBits(bits=32) BM3D_CUDA(sigma=0.5, radius=2) BM3D_VAggregate(radius=2) ConvertBits(bits=16)
Changelog
Version Date Changes
test10 2023/01/25 - update to cuda 12 - bug fixes - add support for Ada Lovelace - remove support for Kepler and x86 test9 2022/07/15 - fix temporal padding (x86 build is deprecating) test8 2022/02/15 - remove avx test7 2022/02/14 - fix performance regression introduced in test6 - restore avx on win64 and msvc rt dynamic linking) test6 2022/02/14 - add support for cc 3.5, links to the static msvc rt, remove avx) - (don't use; accidentally build with debug config) test5 2021/10/16 - bm_range now defaults to 9 instead of 8, fixes parameter check test4 2021/09/08 - fix array parameter test3 2021/08/06 - Separates VAggregate and compiles for AVX test2 2021/08/01 - CPU version test1 2021/07/25 - Initial release
External Links
- GitHub - Source code repository.
Back to External Filters ←