AVSTP

Abstract
Author	Firesledge (aka cretindesalpes)
Version	v1.0.4.1
Download	avstp-1.0.4.1.7z
Category	Multi-threading support, development tools
License	WTFPLv2
Discussion	Doom9 Thread

Description

AVSTP (Avisynth Thread Pool) is a programming library for Avisynth plug-in developers. It helps supporting native multi-threading in plug-ins. It works by sharing a thread pool between multiple plug-ins, so the number of threads stays low whatever the number of instantiated plug-ins. This helps saving resources, especially when working in an Avisynth MT environment. This documentation is mostly targeted to plug-ins developpers, but contains installation instructions for Avisynth users too.

Requirements

[x86] AviSynth+ or AviSynth 2.60
[x64] AviSynth+

Using AVSTP

From the Avisynth user point of view, an AVSTP-enabled plug-in requires the avstp.dll file to be installed. Put it in the usual AviSynth 2.5\plugins\ directory, or load it manually with LoadPlugin("path\avstp.dll") in your script. The dll is shared between all plug-ins using AVSTP, so keep only one avstp.dll file in your plug-in set. If you’re updating from a previous version, make sure that Avisynth will get access only to the latest one.

If a plug-in requiring AVSTP cannot find the dll file, it could crash, emit an error, or fall back gracefully on a single-threaded mode, depending on its design and implementation. There is no mandatory or pre-defined behaviour for such a case.

The number of threads is automatically set to the number of available logical processors. The thread count can also be controlled via an Avisynth function, so multi-threading can be disabled globally if not desired.

avstp_set_threads

avstp_set_threads (var c, int n)

This function sets the global number of AVSTP threads running in the Avisynth process. The call is optional. When n is set to 1, AVSTP is monothreaded. When n is set to 0, AVSTP uses the number of detected logical processors, which is the default when the function is not called. To work smootly, calling the function only once is preferred, at the end of the main script, just before returning the final clip. Behaviour is unspecified if it is called multiple times, or from the Avisynth runtime system (ScriptClip, ConditionalFilter, etc.) The c parameter is ignored and is defined only to make the function work as desired when applied on a clip.

Implementation note: the number of additional instantiated threads is equal to n − 1. Actually, the main program thread is used as a worker thread too, making the total number of working threads equal to n. This is because the tasks are supposed to be globally synchronous, not totally asynchronous like with a generic thread pool system.

Programming with AVSTP

Concept

AVSTP provides plug-ins with a thread pool, an object that can perform tasks asynchronously. This means you just need to split the main task into several smaller tasks able to work in parallel, send them all to the thread pool and wait for their completion. The thread pool will handle the job queue and dispatch the tasks to the available worker threads.

Source code content

The src directory contains the following items:

avstp: source code to build avstp.dll
avstp_example: a simple Avisynth plugin (average of two clips) showing how to use the AVSTP helpers.
common: miscellaneous C++ helper classes and wrappers on the top of the main AVSTP API.

Overview

If you use the main API, add avstp.lib to your project, or load the dll manually on startup to resolve the avstp_* symbols. avstp.h should be the only included file you’ll need for the basic operations.

If your programming language is C++, you can also use the AvstpWrapper singleton class to take care of finding the library, loading it and resolve the symbols. It will use a single-threaded mode if the library cannot be found. This method is recommended to access the API.

MTSlicer is a simple class for easily splitting a frame into horizontal bands, each processed by a thread. This method is suited to a majority of plug-ins. See the next section for more information. Another class, MTFlowGraphSched, can handle a simple graph of dependencies between tasks.

Also, the conc namespace offers a set of tools for dealing with data in a concurrent environment while minimizing the locks. More specifically, lock-free pools of objects are useful to handle temporary recyclable storage for the threaded tasks. See ObjPool and ObjFactoryInterface.

A typical use of the low-level API looks like this:

Call avstp_create_dispatcher() at the beginning of a frame to process, or do it once in the filter constructor
Process the frame using this dispatcher
Call avstp_destroy_dispatcher() when done (or in the destructor).

Each threaded operation looks like this:

Initialize the tasks with the source data
Call avstp_enqueue_task() on all these tasks
Call avstp_wait_completion() to wait for synchronization.

A task is made of two entities:

A static function performing the task
A pointer on your private data structure.

Notes

It’s important to always have in mind that the tasks launched in parallel can be started and finished in any order. A task could even be completed before avstp_enqueue_task() returns. Actually this is the case when AVSTP works in single-threaded mode.

Be very careful when sharing data between tasks. Avoid multiple tasks writing at the same location. It will cause inconsistencies and kill performances.

Don’t call waiting functions (mutex, disk I/O…) from a task. The thread would stall, wasting CPU time. For maximum performances, the tasks should only perform number crunching. Use separate threads to perform blocking operations.

If you want to make your plug-in compatible with MT mode 1, that is multiple threads accessing simultaneously a single filter instance, create the dispatcher dynamically. Do not store it in the class members, nor any other temporary resource. Otherwise you’ll be limited to modes 2 and above (but nobody expects plug-ins to run in mode 1 anyway…)

Dispatchers are not destroyed, they are stored for further recycling. This guarantees that creations and destructions of dispatchers are almost free of overhead after the first run.

AVSTP low-level API use locks only in the internal thread pool construction, when setting the number of threads and when the communication system runs out of free message cells. There shouldn’t be any memory allocation excepted in these cases.

If you’re going to use floating point data, always start the task with an _mm_empty() instruction. Indeed there is not guarantee on the state of the FP/MMX registers, since the thread could have been previously used by other tasks using the MMX instruction set without restoring the registers. And if you’re using the MMX instruction set, it is a nice practice to always call _mm_empty() before leaving a task.

Simple use

Most filters can work by computing independently single pixels or small blocks of the destination frame. These filters can be multithreaded by splitting the frame into as many horizontal bands as there are threads. This is similar to the MT() function, but much more flexible because the processing functions can access the whole input frames, not just a window corresponding to the rendered band.

Filters that require first a global evaluation (for example computing the average luma of the input) don’t fall directly into this category, but may be easily transformed to fit the concept. For example, you can use a conc::AtomicInt<int64_t> to store the global luma sum, which will be updated at the end of each band processing. Then use a second pass of band splitting to do the actual process using the calculated average. For more complex data patitioning and merging, you can use the map/reduce concept.

A sliced-frame example

Here is a step-by-step guide to build a multi-threaded filter with the MTSlicer class. We’ll assume you’re already familiar with single-threaded Avisynth filter programming. The following code is taken from the avstp_example project, which is a plug-in averaging the two input clips.

Declarations

First, the include files. There are the usual Windows.h and avisynth.h, more the AVSTP-related files. Nothing fancy here, excepted the NOMINMAX definition, to put before the Windows header inclusion. Indeed, Windows min and max macro definitions interfere with the standard std::min and std::max template functions contained in <algorithm>. The latter functions are used in the library.

#define NOMINMAX
#define WIN32_LEAN_AND_MEAN
#include "Windows.h"
#include "avisynth.h"
#include "avstp.h"
#include "MTSlicer.h"

In the filter class definition, we’ll have to declare a structure containing data that all the tasks may need. We’ll put here mainly the current frame structure information. You can put any data related to the current frame (input and output). However _this_ptr is a mandatory member and must be declared as a pointer to your filter class.

   class TaskDataGlobal
   {
   public:
      AvstpAverage * _this_ptr;
      ::BYTE *       _dst_ptr;
      const ::BYTE * _sr1_ptr;
      const ::BYTE * _sr2_ptr;
      int            _stride_dst;
      int            _stride_sr1;
      int            _stride_sr2;
      int            _w;
   };

Separating the global frame data from the main filter class is MT-mode-1-safe. If compatibility with MT mode 2 is enough for your needs, just put everything as members of your Avisynth filter class. In this case, you can omit _this_ptr.

Then, we typedef the slicer, for easy later use. The first template parameter is your filter class (actually the same as pointed by the _this_ptr). The second one the data structure common to all slices.

   typedef  MTSlicer <AvstpAverage, TaskDataGlobal> Slicer;

Averaging frames can be on a plane basis: all planes are independent from each other. Instead of starting N threads for a plane, waiting for their completion and starting again for the other planes, we can send all the tasks simultaneously and finally wait for the completion only once. However, this requires multiple common structures, as the data for each plane are different. A second consequence is that it also requires several slicers, because plane heights may be different (luma vs chroma). So we bundle a TaskDataGlobal with a Slicer for easier access and storage.

   class PlaneProc
   {
   public:
      TaskDataGlobal _tdg;
      Slicer         _slicer;
   };

Then we declare our main processing function. This is a simple member function that is fed with a specific structure containing the top and bottom lines to process as well as a pointer to the global structure. The function code is be described below.

   void           process_subplane (Slicer::TaskData &td);

Functions

The GetFrame() function. First, we loop on planes to fill the structure and call start() on the slicer.

The height parameter is used by the slicer to generates the line numbers for each slice. Generally, you may want to set it to the current plane height in pixels, because dividing the frame horizontally has a greater cache efficiency. However, it really can be any unit you want. For example, if you’re processing the frame by blocks, it’s better to count block rows instead of pixel rows, because you’re sure that the frame will be sliced correctly. If the blocks are huge in size (low number of blocks), you can count an absolute number of blocks to be sure that the frame will be equally sliced.

The second parameter is the global data to attach, and the third one the processing function. Note the funky way to get a pointer on a member function.

Don’t forget to initialize _this_ptr, too.

In the second part, we loop on the slicers to wait for their respective task completion. The loop exits only when all the tasks are finished.

::PVideoFrame __stdcall   AvstpAverage::GetFrame (int n, ::IScriptEnvironment *env_ptr)
{
   ::PVideoFrame  fsr1_sptr = _src1_sptr->GetFrame (n, env_ptr);
   ::PVideoFrame  fsr2_sptr = _src2_sptr->GetFrame (n, env_ptr);
   ::PVideoFrame  fdst_sptr = env_ptr->NewVideoFrame (vi);

   // Sets one slicer per plane, because planes may have different
   // heights (YV12). We could have done it differently and handle
   // the chroma subsampling from the plane processing function,
   // but it would have been less trivial to deal with.
   PlaneProc      pproc_arr [3];
   for (int p = 0; p < _nbr_planes; ++p)
   {
      static const int  plane_id_arr [3] = { PLANAR_Y, PLANAR_U, PLANAR_V };
      const int      plane_id = plane_id_arr [p];
      PlaneProc &    pproc    = pproc_arr [p];

      // Collects the source and destination plane information
      pproc._tdg._this_ptr   = this;
      pproc._tdg._dst_ptr    = fdst_sptr->GetWritePtr (plane_id);
      pproc._tdg._sr1_ptr    = fsr1_sptr->GetReadPtr (plane_id);
      pproc._tdg._sr2_ptr    = fsr2_sptr->GetReadPtr (plane_id);
      pproc._tdg._stride_dst = fdst_sptr->GetPitch (plane_id);
      pproc._tdg._stride_sr1 = fsr1_sptr->GetPitch (plane_id);
      pproc._tdg._stride_sr2 = fsr2_sptr->GetPitch (plane_id);
      pproc._tdg._w          =   (vi.IsPlanar ())
                               ? fdst_sptr->GetRowSize (plane_id)
                               : vi.RowSize ();
      const int height       = fdst_sptr->GetHeight (plane_id);

      // Setup and run the tasks. We need to provide the slicer with
      // the plane height so it can generate coherent top and bottom
      // row indexes for each slice.
      pproc._slicer.start (height, pproc._tdg, &AvstpAverage::process_subplane);
   }

   // Waits for all the tasks to terminate
   for (int p = 0; p < _nbr_planes; ++p)
   {
      pproc_arr [p]._slicer.wait ();
   }

   return (fdst_sptr);
}

Finally, the processing function. The td structure contains data specific to the slice. It’s private to the slice, not shared. td._y_beg is the first row to process. Important: the interval is half-open, so td._y_end is not the last row, it’s the row after the last one. Therefore td._y_end - td._y_beg gives the number of lines to process.

void   AvstpAverage::process_subplane (Slicer::TaskData &td)
{
   assert (&td != 0);

   const TaskDataGlobal &   tdg = *(td._glob_data_ptr);

   // Sets our pointers to the slice top location
   ::BYTE *       dst_ptr = tdg._dst_ptr + td._y_beg * tdg._stride_dst;
   const ::BYTE * sr1_ptr = tdg._sr1_ptr + td._y_beg * tdg._stride_sr1;
   const ::BYTE * sr2_ptr = tdg._sr2_ptr + td._y_beg * tdg._stride_sr2;

   // Slice processing
   for (int y = td._y_beg; y < td._y_end; ++y)
   {
      for (int x = 0; x < tdg._w; ++x)
      {
         dst_ptr [x] = ::BYTE ((sr1_ptr [x] + sr2_ptr [x] + 1) >> 1);
      }

      dst_ptr += tdg._stride_dst;
      sr1_ptr += tdg._stride_sr1;
      sr2_ptr += tdg._stride_sr2;
   }
}

td._slicer_ptr, which is ignored here, can be used to access the dispatcher via the slicer object. Thus, additional tasks may be enqueued, they will be attributed to the same group as the current task (and waited by the same call).

That’s about everything you need to know to build a robust and efficient Avisynth multi-threaded plug-in.

Complex tasks

There are tasks that are complex and cannot be decomposed in multiple parallel sub-tasks from the beginning to the end. Some calculations may have dependencies on other calculations, showing a lot of vertices on the processing flowgraph. Here are a few general tips:

Use avstp_wait_completion() only for the final node, or for global synchronization points.

You can enqueue one or more task at any moment from another task. This is useful to manage a branching after a calculation.

For partial synchronizations, associate an atomic counter with the synchronization point. At the beginning, set the counter to the number of task to be synchronized. Decrement the counter at the end of all these sub-tasks. If the counter reaches 0, the synchronization is done, you can continue with further calculations. If it is still positive, just leave the task immediately. For the atomic counter, you can use the conc::AtomicInt template class or just the InterlockedIncrement/InterlockedDecrement system functions.

On branching, to save some calls to the library (and maybe system calls related to semaphores), enqueue N-1 task and continue with the Nth task, instead of enqueuing all the new tasks and leaving.

If your tasks need temporary resources or buffers that cannot fit in the stack, you can use a conc::ObjPool and implement conc::ObjFactoryInterface to share them efficiently between threads.

API reference

All the following functions are declared in avstp.h. They are all thread-safe, so they can be called concurrently from any thread. If the call fails, a null pointer is returned, or a negative error code if the return value is an int.

avstp_get_interface_version

int avstp_get_interface_version ();

Returns the interface version. See the avstp_INTERFACE_VERSION value in the header.

avstp_create_dispatcher

avstp_TaskDispatcher * avstp_create_dispatcher ();

Call this function once to create a dispatcher, before processing anything. The call must be paired with a call to avstp_destroy_dispatcher() to destroy the dispatcher.

avstp_destroy_dispatcher

void avstp_destroy_dispatcher (avstp_TaskDispatcher *td_ptr);

When you’re done with multi-threading, call avstp_destroy_dispatcher() to release the dispatcher. Not doing this may result in resource leaks. Destroy the dispatcher only when all the associated tasks are known to be finished. td_ptr is the dispatcher to destroy.

avstp_get_nbr_threads

int avstp_get_nbr_threads ();

Returns the current number of working threads, including the main thread. The value may vary during the filter construction stage, depending on an avstp_set_threads() script function call.

avstp_enqueue_task

int avstp_enqueue_task (avstp_TaskDispatcher *td_ptr,
                        avstp_TaskPtr task_ptr,
                        void *user_data_ptr);

Schedules the asynchronous execution of a task. td_ptr is the dispatcher which the task will be attached to. task_ptr and user_data_ptr are respectively a pointer to the function to execute and a pointer to its private data, given as parameter. You can enqueue new tasks at any moment, even if the main threads has started waiting.

avstp_wait_completion

int avstp_wait_completion (avstp_TaskDispatcher *td_ptr);

When called, the function will wait for the completion of all the tasks attached to the dispatcher. If you do not want to wait for some tasks, create a second dispatcher for them. The function will return immediately if no task have been previously scheduled.

Changelog

v1.0.4.1, 2023.01.24

Fixes issue: a hang of avstp 1.0.4, see issue #1 for more information.

v1.0.4, 2020.10.07

Compiled with MSVC 2019
C++11 or above is now required for compilation
Window XP the 32-bit version is not supported anymore

v1.0.3, 2015.12.30

Compiled with MSVC 2013
Upgraded internal files

v1.0.2, 2013.11.04

Added a 64-bit version.

v1.0.1, 2012.05.27

Removed any floating point code from the implementation, so avstp doesn’t get confused when the client code doesn’t flush the FP register state after MMX operations. It was occasionally causing deadlocks.

v1.0.0, 2012.03.11

Initial release

Archived Downloads

Version	Download	Mirror
v1.0.4	avstp-1.0.4.zip	avstp-1.0.4.zip
v1.0.3	avstp-1.0.3.zip	avstp-1.0.3.zip

External Links

Back to External Filters ←