Examining Execution Speed of JITted Code With CF 2.0

Chris Tacke, eMVP

OpenNETCF Consulting

August 3, 2005

Introduction

I've always thought that doing a device driver in managed code was a compelling idea, but there are a couple large hurdles that have to be overcome.  First is the lack of deterministic behavior in a managed code environment and second is the purported lack of performance of managed code.

The first hurdle - the non deterministic nature of a garbage collected environment - has been discussed and demonstrated fairly well, and while it's something I'm still trying to work around, it is something that I feel doesn't need much discussion.

The second hurdle - performance - is something different. While I've seen it argued on several occasions that since managed code is not truly compiled code and that it runs against the .NET Common Language Runtime, it inherently must perform worse than native code.  Surprisingly I've never actually seen anything that specifically set out to quantify the difference, so I decided that before I just accept "common knowledge" that maybe a little testing was in order.

The Baseline - Toggling a GPIO with C

Before I could test managed code and have meaningful results, I needed to get a set of control data.  How fast can some meaningful action occur with typical unmanaged code?  I decided that a reasonable test would be to toggle a processor general-purpose input/output (GPIO) line as fast as possible since it's a common action in a driver, and using an oscilloscope it would be very easy to get a quantifiable measurement of speed.

So I put together the following piece of code to toggle a GPIO as fast as possible:

#define GPIO3 (1 << 3)

...

DWORD *p = (DWORD*)MapAddress(0x40E00000);

 

DWORD *gpdr = p + (0x10 / sizeof(DWORD));

DWORD *gpsr = p + (0x18 / sizeof(DWORD));

DWORD *gpcr = p + (0x24 / sizeof(DWORD));

 

*gpdr |= GPIO3;

while(true)

{

        *gpsr = GPIO3;

        *gpcr = GPIO3;

}

For the curious, the call to MapAddress is a function that wraps VirtualAlloc and VirtualCopy in the same way MmMapIoSpace does to get a mapped virtual address for a specified physical address.  Here I passed in the base physical address of the PXA255's GPIO registers, then I allocated pointers to the direction (GPDR), set (GPSR) and clear (GPSR) registers.  Basically how these work is you set the state of a bit in GPDR to determine whether it's an input or an output, then you set the same bit in GPSR to turn it on or set the bit in GPCR to turn it off.

The measurement would be nothing more than the two calls in the while loop to set and clear GPIO3 as fast as possible.  Below is the compiler output for the prvious code.

; 49   : while(true)

; 50   : {

; 51   :        *gpsr = GPIO3;

  0004c e5930000         ldr       r0, [r3]

  00050 e5804000         str       r4, [r0]

; 52   :        *gpcr = GPIO3;

  00054 e5921000         ldr       r1, [r2]

  00058 e5814000         str       r4, [r1]

; 53   : }

You can see that the compiler has turned this into two pair of load and store operations.  While this could have been made faster by writing it in assembly, the purpose of this test wasn't to get the best possible time from unmanaged code, but to compare managed code with typical unmanaged code.

Figure 1 is a captured scope trace of the output produced by the unmanaged code, measured right on the pin of the processor.  The important piece of information to see is that a state change (high to low or low to high) is pretty consistent and is about 110ns.

 

Figure 1 – Oscilloscope traces for unmanaged code

Using C#

Now that we've got the control measured, let's take a look at how we can implement the same feature (toggling GPIO3) in managed code and the speed we see from it.  For my testing I chose to use C# instead of VB.NET.  Initially this choice was simply a matter of personal preference, but as we'll see shortly, some features available in C# but not VB.NET gave faster results.

The base of my first tests were the OpenNETCF.IO.PhysicalAddressPointer class.  This is a pretty simple class that P/Invokes the VirtualAlloc and VirtualCopy APIs to map a physical address to a virtual address just like the C code used earlier.  The calling code can be seen here:

int gpio3 = (1 << 3);

 

// map all of GPIO space

PhysicalAddressPointer pap;

pap = new PhysicalAddressPointer(0x40E00000, 0x6B);

 

// make an GPIO output

int gpdr = pap.ReadInt32(0x10);

pap.WriteInt32(gpdr | gpio3);

 

while(true)

{

      // turn it off

      pap.WriteInt32(gpio3, 0x24);

 

      // turn it on

      pap.WriteInt32(gpio3, 0x18);

}

 

An important difference between the hardware access in this code versus what was done with the unmanaged code is that the virtual address is stored as an IntPtr in managed code.  This means that any reads from or writes to the address are done through a call to Marshal.Copy instead of directly to the pointer address like we were able to do in C.  Intuitively I felt that this was going to add some overhead, and the resulting scope trace, seen in Figure 2, shows that it is indeed slower. 

Even though the managed code was significantly slower it was very consistent, and was still faster than I had expected considering managed code had to make a function call to the Marshal class, which then had to marshal the data to the IntPtr address location.  The question remained "how much of the difference is the overhead of the extra calls, and how much can be attributed to the Common Language Runtime (CLR) itself that the code runs in?"  To determine that, I needed a better way to get at the hardware address, something that takes the IntPtr and Marshal calls out of the picture.  This is where I had to turn to a C# code feature: unsafe code.  Unsafe code simply means that if I set a specific compiler option, I'm allowed to allocate and use pointers in my managed code.

To use a pointer I had to make a slight modification to the OpenNETCF.IO.PhysicalAddressPointer implementation to give external classes access to the internal virtual address as a uint* using the IntPtr.ToPointer method.  Using the newly exposed function I modified my test code to look like this:

 

// toggle GPIO 3

int gpio3 = (1 << 3);

 

// map all of GPIO space

PhysicalAddressPointer pap;

pap = new PhysicalAddressPointer(0x40E00000, 0x6B);

 

unsafe

{

      int *p = (int*)pap.GetUnsafePointer();

      int *gpsr = p + (0x18 / 4);

      int *gpcr = p + (0x24 / 4);

      int *gpdr = p + (0x10 / 4);

}

 

When I measured the state changes this time I was pleasantly surprised.  The traces, as seen in Figure 3, were identical to the unmanaged traces, meaning that the CLR was adding zero measurable overhead to the hardware access.  All of the latency measured in the first managed code test lie in the overhead of the call to the Marshal class.

My longer term goal was to make access to the hardware a little more user friendly by providing a wrapper class for the entire PXA255 processor, but I also wanted maximum performance to remain a goal.  I wanted VB.NET developers to have the same advantages that C# developers would get, so I did some rethinking on how to get at the virtual address without going through the Marshal class.

The first thought was to try to get a struct that would map its members directly to the registers in the processor, and then pin those into memory.  Unfortunately even with a pinned struct, you're still relegated to using the Marshal class for passing data to the mapped target address.

I then decided that if the PXA255 wrapper used unsafe pointers internally that were wrapped by CLS compliant properties, VB developers would be able to directly access hardware as well as benefit from the speed of unsafe code.  I then put together a comparable test using the PXA255 class and checked the performance with the scope.

PXA25x pxa = new PXA25x();

 

// set gpio3 as an output

pxa.GPIO.GPDR0 |= PXA25x.GPIO3;

 

while(true)

{

      // set the pin

      pxa.GPIO.GPSR0 = PXA25x.GPIO3;

      // clear the pin

      pxa.GPIO.GPCR0 = PXA25x.GPIO3;

}

Once again I was surprised by the result, but this time the surprise wasn't pleasant.  It turned out that even though the class was using unsafe pointers, the results were a similarly large latency (see Figure 4).   It appeared that it wasn't the internals of the Marshal class that were the performance hit, it was simply the fact that a method call was being made.

The last step was to physically verify that hunch, so I wrote a last bit of test code using the PXA255 class, but retrieving its internal pointer and then using the pointer locally in the test.  Of course this isn't VB-accessible, but it would prove the theory about the location of the performance bottleneck.

PXA25x pxa = new PXA25x();

 

unsafe

{

      uint *gpio = pxa.GetGPIORegistersUnsafePointer();

      uint *gpsr = gpio + (0x18 / 4);

      uint *gpcr = gpio + (0x24 / 4);

      uint *gpdr = gpio + (0x10 / 4);

 

      // make GPIO an output

      *gpdr |= PXA25x.GPIO3;

 

      while(true)

      {

            // set the pin

            *gpsr = PXA25x.GPIO3;

            *gpcr = PXA25x.GPIO3;

      }

}

In Figure 5 you can see that using the pointer again provided the same level of performance that the unmanaged code did, proving that the expense is simply the fact that a method call had been made, not that the Marshal class has any inherent bottleneck.  In fact this shows that the Marshal class internally is actually quite performant, adding very little overhead beyond the call into it.

Conclusion

We now see that the performance between managed code and unmanaged code can be negligible if the developer writing the code is cognizant of the behavior characteristics of the managed environment.  What does that buy us as a community of developers?  Potentially, the implications are immense. 

As it stands right now we easily have the required performance to write device drivers for many items that are tolerant of large potential, but typically rare, latencies.  Things like I²C or SPI serial busses or other GPIO devices. We've also seen that managed code can perform equally to unmanaged code, so if we can find a way to eek out deterministic behavior from the CLR, then we easily have the performance required for a whole host of devices.

With a little ingenuity on our part and a little cooperation from those developing future versions of managed compilers and specifications, writing device drivers in managed code could become a commonplace task.  I'm not advocating that we do away with unmanaged code - it certainly has it place, just as assembly still does, but we don't need to fear managed developers playing with hardware any more than the assembly developers of yesteryear needed to worry about C and C++ developers.  Change is what has always driven the industry and what I do advocate is embracing that change because it looks like it's going to be fun.