Temporary storage for a reentrant function

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

User avatar
Qaanol
The Cheshirest Catamount
Posts: 3069
Joined: Sat May 09, 2009 11:55 pm UTC

Temporary storage for a reentrant function

Postby Qaanol » Fri Jan 03, 2014 4:54 am UTC

I’m writing a C function that uses an array of 1024 floats to hold intermediate results of vectorized calculations. Is there an elegant alternative to having a pointer input parameter through which the caller supplies temp storage?

A static local array is insufficient because it breaks reentrancy, and malloc/free is undesirable because I don’t want to waste time setting up new memory with every function call.

About the only solution I can come up with is to keep a static linked list of pointers to temporary arrays that are available to be reused. If the function gets called with no available arrays then it mallocs one, and adds it to the list of available arrays when it finishes.

Of course, then I’d want some way to free the extra arrays when they are no longer needed. One approach would be a helper function that just frees the reusable arrays and has to be called manually. Another is to track how long it’s been since all the arrays were used simultaneously, and free one if it’s been more than a certain interval; but I believe that would involve a separate thread, which is more hassle than I want.

So…I’ll probably just require the caller to provide temporary storage space, unless there’s a better option?
wee free kings

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: Temporary storage for a reentrant function

Postby EvanED » Fri Jan 03, 2014 5:27 am UTC

I presume that you don't like the idea of just allocating it on the stack in the function for some reason? (Or does it need to persist across calls? Doesn't sound like it to me, but it's also not totally clear.)

I personally would prefer the malloc/free version until you have hard evidence that it's costing you performance. One thing you could do as an alternative if most of the time there will only be a small number of calls active at once (for some definition of "small") is to have a fixed number of static buffers, and if one is available grab it, otherwise fall back to malloc/free. (Conceptually this is just like malloc/free except you're not going to the typical allocator when something hits in your "cache".) But I'm not sure this is worth it. (Though I'm not sure exactly what malloc will do here... I think it might actually go and memmap then memunmap pages for you at allocations that big, which might be a little slow.)

User avatar
Thesh
Made to Fuck Dinosaurs
Posts: 6580
Joined: Tue Jan 12, 2010 1:55 am UTC
Location: Colorado

Re: Temporary storage for a reentrant function

Postby Thesh » Fri Jan 03, 2014 6:28 am UTC

If you must use dynamically allocated memory instead of a local array, and you are worried about having to know the size of the array in advance, you could have a state struct that is allocated using utility function, so you don't have to worry about the caller knowing details of the implementation. You can even forward declare the struct, so if you change storage requirements in the library it won't require recompiling the executable (since you are only dealing with pointers).

Example:

util.h

Code: Select all

struct foo;

struct foo *init();
float something(struct foo *bar);
struct foo *dest(struct foo *bar);


util.c

Code: Select all

#include "stdlib.h"
#include "util.h"

struct foo {
   float baz[1024];
};

struct foo *init()
{
   return (struct foo *)malloc(sizeof(struct foo));
}

float something(struct foo *bar)
{
   int i;
   float f;
   for (i=0; i<1024; i++) {
      bar->baz[i] = (float)i;
   }
   f = 1;
   for (i=0; i<4096; i++) {
      f += bar->baz[i%1024];
      bar->baz[i%1024] /= f;
   }
   return f;
}

struct foo *dest(struct foo *bar)
{
   free(bar);
   return NULL;
}


main.c

Code: Select all

#include <stdio.h>
#include "util.h"

int main() {
   float f;
   struct foo *bar;
   bar = init();
   f = something(bar);
   bar = dest(bar);

   printf("%f\n", f);

   return 0;
}
Summum ius, summa iniuria.

User avatar
Qaanol
The Cheshirest Catamount
Posts: 3069
Joined: Sat May 09, 2009 11:55 pm UTC

Re: Temporary storage for a reentrant function

Postby Qaanol » Sun Jan 05, 2014 3:41 am UTC

EvanED wrote:I presume that you don't like the idea of just allocating it on the stack in the function for some reason? (Or does it need to persist across calls? Doesn't sound like it to me, but it's also not totally clear.)

Just for performance reasons, I’d prefer not to allocate and deallocate memory with every function call. The values do not need to be stored, but I would like to reuse the same arrays as much as feasible.

EvanED wrote:I personally would prefer the malloc/free version until you have hard evidence that it's costing you performance. One thing you could do as an alternative if most of the time there will only be a small number of calls active at once (for some definition of "small") is to have a fixed number of static buffers, and if one is available grab it, otherwise fall back to malloc/free. (Conceptually this is just like malloc/free except you're not going to the typical allocator when something hits in your "cache".) But I'm not sure this is worth it. (Though I'm not sure exactly what malloc will do here... I think it might actually go and memmap then memunmap pages for you at allocations that big, which might be a little slow.)

Hmm, that is worth considering. I could even do it with just one static array, and allocate on the stack for any additional calls. Even then, though, to handle cases where my function gets called from different threads, I’d need to use a lock, right? If not on the static array itself, then at least on the flag that says whether the static array is available, I think.

Thesh wrote:If you must use dynamically allocated memory instead of a local array, and you are worried about having to know the size of the array in advance, you could have a state struct that is allocated using utility function, so you don't have to worry about the caller knowing details of the implementation. You can even forward declare the struct, so if you change storage requirements in the library it won't require recompiling the executable (since you are only dealing with pointers).

Hmm, also worth considering, though I’m not sure how it addresses the specific problems of reusing memory and also freeing it after a while.

What I might do is take a (float** tempPtr) parameter, and also keep one static local array. If the static array is available, the function will use it. If not, and tempPtr is zero, memory is allocated and freed on the stack inside the function. Otherwise, if *temp is non-zero the function will use the memory pointed to by *tempPtr as intermediate storage; but if *tempPtr == 0 then the function will allocate memory and put the pointer to it in *tempPtr, so the caller can reuse the same pointer in subsequent calls without worrying. When the caller is done it can free *temp.

So my function would look like this:

Code: Select all

void doVecOps(float *outVec, const float *inVec1, const float *inVec2, unsigned int numElems, float **tempPtr)
{
  static const int VEC_LENGTH = 1024;
  static float localTempMem[VEC_LENGTH];
  static int localTempAvail = 1;
 
  int needsFree = 0;
  float *temp;
  if (localTempAvail) {
    localTempAvail = 0;
    temp = localTempMem;
  } else if (!tempPtr) {
    temp = malloc(VEC_LENGTH * sizeof(float));
    if (!temp) {
      // Handle malloc failure
      return;
    }
    needsFree = 1;
  } else if (!*tempPtr) {
    temp = malloc(VEC_LENGTH * sizeof(float));
    if (!temp) {
      // Handle malloc failure
      return;
    }
    *tempPtr = temp; // Caller is responsible for freeing this memory
  } else {
    temp = *tempPtr; // Caller provided scratch space
  }
 
  // Do the actual work in a helper function
  vecOpHelper(outVec, inVec1, inVec2, numElems, temp);
 
  if (temp == localTempMem) {
    localTempAvail = 1;
  } else if (needsFree) {
    free(temp);
  }
}


And the caller would do something like this:

Code: Select all

float *temp = 0;
doVecOps(vOut, vIn1, vIn2, n, &temp);
doVecOps(vOut, vOut, vIn3, n, &temp);
doVecOps(vOut, vOut, vIn4, n, &temp);
free(temp);


Although, I don’t think this is thread safe, because I don’t take a lock on localTempMem (or at least localTempAvail). So either I should eliminate the static array, or I should use locks, or I should go with a different approach altogether.
wee free kings

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: Temporary storage for a reentrant function

Postby EvanED » Sun Jan 05, 2014 3:52 am UTC

Qaanol wrote:
EvanED wrote:I presume that you don't like the idea of just allocating it on the stack in the function for some reason? (Or does it need to persist across calls? Doesn't sound like it to me, but it's also not totally clear.)

Just for performance reasons, I’d prefer not to allocate and deallocate memory with every function call. The values do not need to be stored, but I would like to reuse the same arrays as much as feasible.
Stack-bound arrays will take virtually no performance from you. Most of the time, adding a new local variable or array will cost literally nothing, because it will simply change the amount that is subtracted from the stack pointer (e.g. adding a couple doubles might change sub esp, 16 at the start of your function into sub esp, 32, and that's basically it). Now, going from the point where the stack frame takes less than one page to more than one page (which adding your array will almost certainly do) will add a tiny bit to that because it will have to make sure that there are pages allocated, but that's still almost free. (It will do that by touching an address 1 page apart. So for instance, at the start of your function prolog you'd have an extra mov eax, [ebp+4096]; mov eax, [ebp+8192] or something. (OK, that will be wrapped up inside a function call and a loop, but you're still talking maybe two dozen dynamic instructions at the most in the common case.)

The main reason why I think you'd want to avoid stack allocation is if you think you could have a lot of calls to this function active at once and are worried about stack overflow.

Hmm, that is worth considering. I could even do it with just one static array, and allocate on the stack for any additional calls. Even then, though, to handle cases where my function gets called from different threads, I’d need to use a lock, right? If not on the static array itself, then at least on the flag that says whether the static array is available, I think.
On the "next available static buffer", yes. Though you don't need a full lock -- you could get by with some sort of atomic action. I guess with multiple threads you're not guaranteed to allocate and release the buffers in a stack-like manner so would need to make some sort of list. That's still not to hard to do with lockless atomic CAS actions though. Let me know if you want more detail.

User avatar
Qaanol
The Cheshirest Catamount
Posts: 3069
Joined: Sat May 09, 2009 11:55 pm UTC

Re: Temporary storage for a reentrant function

Postby Qaanol » Sun Jan 05, 2014 4:30 am UTC

EvanED wrote:Stack-bound arrays will take virtually no performance from you. Most of the time, adding a new local variable or array will cost literally nothing, because it will simply change the amount that is subtracted from the stack pointer (e.g. adding a couple doubles might change sub esp, 16 at the start of your function into sub esp, 32, and that's basically it). Now, going from the point where the stack frame takes less than one page to more than one page (which adding your array will almost certainly do) will add a tiny bit to that because it will have to make sure that there are pages allocated, but that's still almost free. (It will do that by touching an address 1 page apart. So for instance, at the start of your function prolog you'd have an extra mov eax, [ebp+4096]; mov eax, [ebp+8192] or something. (OK, that will be wrapped up inside a function call and a loop, but you're still talking maybe two dozen dynamic instructions at the most in the common case.)

The main reason why I think you'd want to avoid stack allocation is if you think you could have a lot of calls to this function active at once and are worried about stack overflow.

Good to know, thanks. I’m glad I asked. Testing with just one call at a time shows you are correct. Switching from “static float temp[1024];” to “float temp[1024];” had essentially no performance impact on dozens of sequential calls, and might have even sped things up a tiny bit, though I have not statistically analyzed the running times. (The vectorized version runs about 56% faster than the scalar version, so that part is working well.)

For the purposes at hand I don’t expect anything close to a stack overflow, so it should be fine.

EvanED wrote:
Hmm, that is worth considering. I could even do it with just one static array, and allocate on the stack for any additional calls. Even then, though, to handle cases where my function gets called from different threads, I’d need to use a lock, right? If not on the static array itself, then at least on the flag that says whether the static array is available, I think.
On the "next available static buffer", yes. Though you don't need a full lock -- you could get by with some sort of atomic action. I guess with multiple threads you're not guaranteed to allocate and release the buffers in a stack-like manner so would need to make some sort of list. That's still not to hard to do with lockless atomic CAS actions though. Let me know if you want more detail.

I’d be glad to learn, since I’ve never worked with locks before and might need to in the future. I have a basic grasp of mutex locks, but I am unfamiliar with lockless atomic CAS actions. So if you have time and inclination to explain, please do.

That said, for the function at hand I’ll use stack-allocated arrays as you recommend, and no lock is needed.

Edit: Actually, a different place in the same program uses a static variable that needs some sort of lock. A function calculates two values at once, returns one, and stores the other. If it gets called while a value is stored it returns the stored value, and marks that no value is stored so the next call will calculate two more. Essentially I want the following code to run without possibility of thread-related shenanigans:

Code: Select all

float numberMaker()
{
  static int valueIsStored = 0;
  static float storedValue;
  if (valueIsStored) {
    valueIsStored = 0;
    return storedValue;
  }
  // Calculate x and y
  storedValue = x;
  valueIsStored = 1;
  return y;
}


It is only really the “if” block that needs to be protected, to ensure two simultaneous calls don’t return the same stored value. I am fine with simultaneous calls each creating two values and overwriting the existing stored value.

It would be nice if simultaneous calls could be paired up so no extra work is done beyond what is needed to get a value to each caller, but that strikes me as probably difficult to implement.
wee free kings

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: Temporary storage for a reentrant function

Postby EvanED » Sun Jan 05, 2014 6:10 am UTC

"Non-blocking" is the word I was looking for, or "lock-free".

Basically a lock-free algorithm is one that achieves safety in the face of concurrency via a mechanism other than locks. Let's start with something simple. Suppose you want a count for the number of times something has happened. With just the "usual" primitives you'd have to lock the accesses, something like

Code: Select all

lock(mutex);
++count;
unlock(mutex);

But this has problems -- you're paying the cost to lock even in the case where there is little or no contention for the count variable. A better alternative in this case is to use an atomic increment. You can get this in C++11 or boost with the atomic template (I don't know details) or some platform-specific things like Window's InterlockedIncrement function. Then you can just say InterlockedIncrement(&count).

But this doesn't solve more complicated cases like data structures. For that you need a more powerful atomic primitive called compare-and-swap, or CAS. CAS has the effect of the following function (or something similar):

Code: Select all

CAS(word* address, word old_value, word new_value) {
    if (*address == old_value) {
        *address = new_value;
        return true;
    }
    else
        return false;
}

where atomicity is enforced by the hardware. (Typically CAS will return either the old or new value, but we'll stick with this to make the following a little simpler.) In other words, it updates the given address with the new value, but only if that memory address has the old value. The way you'd use it is to look at the current value, make some candidate changes, then try to "commit" those changes using CAS, where the conditional makes sure (with some caveats) that no other threads have come stomping through since you started work on the candidate changes. If CAS returns true, then your changes were committed.

For example, you can write your own lock-free InterlockedIncrement as follows (this code not guaranteed free of bugs):

Code: Select all

InterlockedIncrement(int* address) {
    int current_value, new_value;
    do {
        current_value = *address;
        new_value = current_value + 1;
    } while(!CAS(address, current_value, new_value));
    return new_value;
}

where this function does not need any kind of concurrency control outside of the use of the atomic CAS.

CAS (and some fancier variants like multi-word CAS, if your hardware supports it) can be used to implement a number of data structures in a way that doesn't use locks. I'm not clear how useful in practice they are, but it's my impression is that when contention is low, they should be faster than lock-based solutions.

Finally, I'd say don't try to write your own algorithms based on CAS until you've done a lot of studying of the potential issues. It's really easy to write something that looks correct but isn't. Like I sort of said above, I wouldn't be surprised if my InterlockedIncrement had some subtle concurrency bug.

lgw
Posts: 437
Joined: Mon Apr 12, 2010 10:52 pm UTC

Re: Temporary storage for a reentrant function

Postby lgw » Mon Jan 06, 2014 8:56 pm UTC

In general, "lock free" algorithms are great at bug generation, and poor at improving performance, unless you have a measured problem with lock contention. Most OSs have fast locking where, assuming you get the lock, do only a single CAS and a couple other instructions, while doing anything non-trivial without a lock always ends up being much more complicated than that.

It's also worth noting that a single CAS is a non-trivial performance hit on modern systems - piercing 3 layers of cache across multiple cores just sucks - so an approach that just needs no form of synchronization between threads at all is usually best, when such alternatives are practical (not that common, but worth looking for).

Another good tool to have handy is "thread local storage" (which, BTW, it looks like C++11 just got into the standard library) which can greatly simplify code when you really want to avoid waiting on a locks. TLS solves a narrow set of problems, but sure is nice when you happen to have one of them.
"In no set of physics laws do you get two cats." - doogly


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 7 guests