Mar 31, 2023

Fun with references in C# - Part II / II

Leia este post em português

Lire cet post en Français.

In my previous post I've presented the following code and posed a couple of questions.


Of course it is contrived and explicitly coded to expose a specific behavior but nonetheless it is not an unrealistic example. That said, without further ado, lets answer each of the questions...

 

What is its output ?

Unfortunately the output will probably be different in each machine/os combination and even on the same machine I've observed differences from run to run, although at least one part of the output is completely deterministic and should always produce the same result: method M2() output should always be something like:

---------M2----------
 4
 3
 2
 1

By the other hand, the output of M3() is undefined and will show what looks like to be gibberish, something like (from one of my runs):

      ---------M3----------
      RefStruct created.
      fe
     7f
     0
     0

but before deeming it as the culprit let's explore M2()/M3() call chains a little deeper...

Can you spot/explain the issue in the code?

In order to more easily demonstrate the problem I'll present a series of simplified call stacks in which:

  1. The stack grows from bottom to top (i.e, higher to lower address)
  2. For simplicity, data stored in each stack entry is assumed to consume a single `spot` (or address), no matter the size of the data (when in reality in most platforms the amount of consumed stack space depends on the specifics of each stored data type).
  3. Gray-ed areas represents parts of the  stack that have been previously used (allocated) but are not valid anymore, i.e, they have been freed.

Even not being 100% accurate this abstraction serve the purpose of this post very well

So, lets start with the call to M2():

 

The call stack 1 represents the state when running Instantiate() for the chain Main() -> M2() -> Instantiate(); in this call chain, Main() called M2() passing 0x01020304 as the v parameter (which has been stored in entry 98 at the stack); M2() called Instantiate() passing a reference to v, i.e, v's address or in other words, 98, as the Instantiate()'s i parameter which instantiated a RefStruct (represented in the stack by a variable named tmp) passing i value (i.e, 98) to its constructor which ended up setting a ref field to 98 and then returning that instance by value.

Upon returning to M2() (call stack 2), the local variable rs is set to the returned value. Notice that at this point 98 is stored in rs.value (or in other words, rs.value points to the location at address 98). 

Finally M2() calls Dump() method (call stack 3) passing a reference to rs, i.e, the address 97; notice that Dump() method is reusing the address 96 (which were previously used to store Instantiate()'s i parameter). At this point, Dump() will simply interpret whatever is at address 97 as as RefStruct and and will print the contents referenced by the rs.value field, now taking whatever is stored at 98 (the value stored at that field) and interpreting it as an int and printing the four bytes that makes up the integer value.

Contrasting that with the M3() call chain,

call stack 1 above represents the call Main() -> M3() -> InstantiateAndLog() -> Instantiate(). Very similar to M2() call chain, but in this case M3() calls into an intermediate method (InstantiateAndLog()) passing in 0x01020304 to its v parameter (by value) which then calls into Instantiate() passing the just received v parameter by reference (i.e, Instantiate() received 96, or, in oder words the address of v in the stack) and just instantiates a RefStruct (at address 93, represented by a variable named tmp) initializing its value field with the reference it just received.

This instance is then just returned, first to InstantiateAndLog() (call stack 2), then to M3()  (call stack 3) which simply stores it in a local variable (rs , at address 97) as shown bellow:
 

Notice that in M3(), the returned struct has its value field pointing (or referencing) to address 96 which is not valid anymore (since it has already been freed) and thereafter any assignment/reading to/from that field is bound to lead to undefined behavior; 

The exact behavior will depend on how that position is further used but in our specific example, after returning from InstantiateAndLog(), M3() calls into Dump() (call stack 4) passing rs as a reference (actually that is not important; it could have been passed by value, and the behaviour would still be wrong) overwriting the previous value at address 96. Now when Dump() calls into BitConverter.GetBytes() it first deferences the RefStruct reference it has been passed which means it goes to address 97 and assumes that there's a RefStruct stored there (which is correct); the next step is to deference the field value (remember, it is a ref field) which will produce (in our example) the value 97 instead of 0x01020304.

How could one pinpoint this issue ? 

The simple answer: always question the code; the use of the unsafe should raise a big red sign and prompt developers to question what may be not safe in that code. In this specific example, removing such modifier immediately results in a compiler error:

error CS8166: Cannot return a parameter by reference 'v' because it is not a ref parameter

Notice that even with the unsafe modifier the compiler have tried its best to warn us about a potential issue by emitting the following warning (which was promptly ignored, of course):

warning CS9087: This returns a parameter by reference 'v' but it is not a ref parameter

So, in my opinion, the most effective way to detect potential issues is to i) have good development practices (like, configure warnings as errors, have a code review process, etc), ii) understand the implications of any technology/approach/library you use in your code, iii) to never ignore compiler warnings and iv) never be afraid to raise questions in code reviews.

See you in the next post.

Have fun!

Adriano

No comments: