Tuesday, October 30, 2012

Installing 64-bit drivers from 32-bit installer

If you are using a 32-bit Windows installer, it is not straightforward to have it install a 64-bit driver. There are at least two reasons why you might be in this situation:
  1. Your installer software is not the latest (or maybe doesn't even have a 64-bit version yet) ... or ...
  2. Most of your components are 32-bit with just one or two that you want to differentiate 32 vs 64 bit.
The problem arises because shelling out to msiexec.exe from a 32-bit installer (and in the case of InstallShield, whether that be from InstallScript or as a Custom Action), the 32-bit C:\Windows\SysWOW64\msiexec.exe gets executed instead of the 64-bit C:\Windows\System32\msiexec.exe.

The basic answer comes from technet and the VB.Net code below is adapted from that with a slight improvement. By compiling the VB.Net code into an executable and shelling to that as an intermediary, the 32-bit world can be escaped from. The slight improvement to the code below is it preserves quotes around quoted arguments such as pathnames with spaces.

Module Module1
    Sub Main()
    Dim arrArgs As Array
    Dim Args As String = ""
    Dim intCount As Integer = 0

    arrArgs = System.Environment.GetCommandLineArgs()
    For Each Arg In arrArgs
        If intCount <> 0 Then
            If Arg.IndexOf(" ") > -1 Then
                Args = Args & " """ & Arg & """"
            Else
                Args = Args & " " & Arg
            End If
        End If
        intCount = intCount + 1
    Next
    Shell("cmd.exe /C" & Args, AppWinStyle.NormalFocus, True)
    End Sub
End Module

Then the InstallScript to invoke it is below.  It detects whether the OS is 64-bit, and if so installs the 64-bit drivers via the VB.Net code above (which is compiled to an executable cmd64.exe); otherwise, it installs the 32-bit drivers.

if ( REMOVEALLMODE=0 ) then
    if (Is(FILE_EXISTS, WINSYSDIR^"CsSsm.dll") = FALSE) then
        if (SYSINFO.bIsWow64) then
            svProgramCmd64 = TARGETDIR^"GaGe64\\cmd64.exe";
            svCmd64MsiExecPath = WINSYSDIR64^"msiexec.exe";
            LongPathToQuote(svCmd64MsiExecPath,TRUE);
            svCmd64MsiPath = TARGETDIR^"GaGe64\\CompuScope.msi";
            LongPathToQuote(svCmd64MsiPath,TRUE);
            svCmd64Param = svCmd64MsiExecPath + " /i " +

                           svCmd64MsiPath + " /passive /norestart";
            LaunchAppAndWait(svProgramCmd64,svCmd64Param,WAIT);
        else
            svProgramMsiExec = WINSYSDIR^"msiexec.exe";
            svGaGe32MsiPath = TARGETDIR^"GaGe32\\CompuScope.msi";
            LongPathToQuote(svGaGe32MsiPath,TRUE);
            svGaGe32Param = "/i " + svGaGe32MsiPath +

                            " /passive /norestart";
            LaunchAppAndWait(svProgramMsiExec,svGaGe32Param,WAIT);
        endif;
    endif;

endif;
The code above is for installing drivers for a GaGe CompuScope analog-to-digital converter board.  I am a user of GaGe boards, not an employee or representative of GaGe.

Sunday, October 21, 2012

XML/XSL/HTML5 for reports instead of PDF

Since video of my actual presentation to the Denver HTML5 Meetup on October 22, 2012 won't be posted for a few more months, I quickly recorded the 10-minute YouTube below.


Below are the slides.

XML/XSL/HTML5 for reports instead of PDF



The official w3.org documentation on embedding XSL in XML actually dates from circa 2000.  Firefox still allows it.  Here is the overall structure of the .XML file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="#stylesheet"?>
<!DOCTYPE doc [
<!ATTLIST xsl:stylesheet
id ID #REQUIRED>
]>
<doc>
 <!--Start XSL-->
 <xsl:stylesheet id="stylesheet"
  version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >

  <xsl:template match="xsl:stylesheet" />
  <xsl:template match="/doc">
   <html>
    <head>
     <style type="text/css">

       <!-- Your CSS goes here -->
     </style>
     <script type="text/javascript">
       <!-- Your Javascript goes here -->
     </script>
    </head>
    <body>
     <table>
      <thead>

       <!-- Whatever is in here will get printed at the top of every page -->
      </thead>
      <tbody>
       <!-- Main HTML, including "canvas" tags etc. -->
      </tbody>
     </table>
    </body>
   </html>
  </xsl:template>
 </xsl:stylesheet>

 <!--Start XML-->
 <seriesdata>
  <datapoint x="2.1" y="3.0" />
  <datapoint x="2.9" y="5.2" />
 </seriesdata>
</doc>


Below is the bit of magic to draw up the XML data into Javascript memory space.  Assuming there is a Javascript constructor called ChartSeries that takes four parameters )name, array of x values, array of y values, color), the code below uses XSL to shove the x values inline in a comma-separated manner into the Javascript.

var mychartseries = new ChartSeries("Channel 1",[0
 <xsl:for-each select="seriesdata/datapoint">
  <xsl:value-of select="concat(',',@x)"/>
 </xsl:for-each>
 ],[0
 <xsl:for-each select="seriesdata/datapoint">
  <xsl:value-of select="concat(',',@y)"/>
 </xsl:for-each>
 ], "Yellow");

Friday, October 19, 2012

Memory writes expensive but parallelizable on Radeon GPGPU

Using a Radeon 7970 as a GPGPU, I was running into some seeming limitations on how quickly I could download data off of the board into main CPU RAM.  There seemed to be about a 20MB/sec limitation for the board, which is of course nowhere near the 16 GB/sec limit of PCI 3.0 x16.  It turns out the limitation is for a single work unit (out of the 2048 work units/processors on the board).  It also turns out that because writes to global memory (i.e. memory sharable with the CPU host) are so expensive, it can often become more important to parallelize the memory writes than to parallelize the computations!  To me, this was counterintuitive because I envisioned writes to shared memory as being serial and fast, but they instead seem to be on some kind of time multiplex for the multiple work units.

Consider the following code that computes the first 1024 Fibonacci numbers, and does so 1024 times over:

#include <iostream>
#include <Windows.h>
#include <CL/cl.h>

int main(int argc, char ** argv) {
 cl_platform_id platform;
 clGetPlatformIDs(1, &platform, NULL);
 cl_device_id device;
 clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
 cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
 cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);
 const char *source =
 "__kernel void fibonacci(__global double* dst) {\n"
 "    __local double buff[1026];\n"
 "    buff[0] = 0, buff[1] = 1;\n"
 "    for (int i = 0; i < 1024; i++) {\n"
 "        for (int j = 0; j < 1024; j++)\n"
 "            buff[j+2] = buff[j+1] + buff[j];\n"
 "        async_work_group_copy(&dst[i*1024], &buff[2], 1024, 0);\n"
 "    }\n"
 "}\n";
 const size_t global_work_size = 1;
 cl_program program = clCreateProgramWithSource(context, 1, &source, NULL, NULL);
 clBuildProgram( program, 1, &device, NULL, NULL, NULL);
 cl_kernel kernel = clCreateKernel( program, "fibonacci", NULL);
 cl_mem buf = clCreateBuffer(context, CL_MEM_WRITE_ONLY, 1024 * 1024 * 8, NULL, NULL);
 clSetKernelArg(kernel, 0, sizeof(buf), (void*)&buf);
 LARGE_INTEGER pcFreq = {}, pcStart = {}, pcEnd = {};
 QueryPerformanceFrequency(&pcFreq);
 QueryPerformanceCounter(&pcStart);
 clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, NULL, 0, NULL, NULL);
 clFinish(queue);
 QueryPerformanceCounter(&pcEnd);
 std::cout << 8.0 * pcFreq.QuadPart / (pcEnd.QuadPart-pcStart.QuadPart) << "MB/sec";
}

Running on an i5-2500, the benchmarks are:
As-is: 21 MB/sec
Memory transfer commented out: 2814 MB/sec
Inner for loop commented out: 38 MB/sec

Clearly the memory transfer is taking the bulk of the time, and the computation of Fibonacci numbers hardly any time at all.  The way to speed it up is to speed up the memory write, but what could possibly be faster than async_work_group_copy()?  It turns out there is a bit of intelligent cache maintenance going on behind the scenes.  If we can write to buff[] from multiple work units, then async_work_group_copy() can pull the data from the memory associated with multiple work units, and it goes much faster.

But how can Fibonacci be parallelized, when it is seemingly a serial, recursive calculation?  We can do so with lookahead.  Based on the basic calculation x2 = x1 + x0, we have:
x3 = x2 + x1
   = x1 + x0 + x1
   = 2 * x1 + x0
x4 = x3 + x2
   = 2 * x1 + x0 + x1 + x0
   = 3 * x1 + 2 * x0
x5 = x4 + x3
   = 3 * x1 + 2 * x0 + 2 * x1 + x0
   = 5 * x1 + 3 * x0
x6 = x5 + x4
   = 5 * x1 + 3 * x0 + 3 * x1 + 2 * x0
   = 8 * x1 + 5 * x0
x7 = x6 + x5
   = 8 * x1 + 5 * x0 + 5 * x1 + 3 * x0
   = 13 * x1 + 8 * x0
x8 = x7 + x6
   = 13 * x1 + 8 * x0 + 8 * x1 + 5 * x0
   = 21 * x1 + 13 * x0
x9 = x8 + x7
   = 21 * x1 + 13 * x0 + 13 * x1 + 8 * x0
   = 34 * x1 + 21 * x0
And our new parallel code is:

 const char *source =
 "__kernel void fibonacci(__global double* dst) {\n"
 "    __local double buff[1026];\n"
 "    __private double coef[8][2] = {{1,1}, {2,1}, {3,2}, {5,3},\n"

 "                                   {8,5}, {13,8}, {21,13}, {34,21}};\n"
 "    buff[0] = 0, buff[1] = 1;\n"
 "    for (int i = 0; i < 1024; i++) {\n"
 "        for (int j = 0; j < 1024; j += 8)\n"
 "            buff[j+2+get_global_id(0)] =\n"

 "                coef[get_global_id(0)][0] * buff[j+1]\n"
 "                + coef[get_global_id(0)][1] * buff[j];\n"
 "        async_work_group_copy(&dst[i*1024], &buff[2], 1024, 0);\n"
 "    }\n"
 "}\n";
 const size_t global_work_size = 8;


This runs at
8 work units: 122 MB/sec
That's a 6x speedup for increasing the number of work units by 8x!  We could no doubt speed it up even more by increasing the look-ahead to increase the number of work units.

Recall that when we commented out the computation completely it was only 38 MB/sec, so the speedup is from parallelizing the memory writes, not from parallelizing the computation.

Thanks once again to the folks at stackoverflow.com in helping me work through this.

Tuesday, October 2, 2012

Supercomputing for $500

Desktop supercomputing is now cheap, mainstream, and mature.  Using GPGPU (General Purpose computing on a Graphics Processing Unit), you can write C programs that execute 25x as fast as a high-end desktop computer alone for just $500 more.

The OpenCL standard, started in 2008, is now mature.  It provides a way for C/C++ programs on Windows and Linux to compile and load special OpenCL C programs onto GPGPUs, which are just off-the-shelf high-end graphics cards that videogame enthusiasts usually buy.  When you buy one of these cards for your supercomputing project, expect lots of snickers from your purchasing or shipping/receiving department when it arrives with computer videogame monsters on the box.

As an example, the approx. $500 Radeon 7970 has 2048 processing cores on it, each capable of double-precision floating point running at about 1 GHz executing on average one double-precision floating point operation per clock cycle.  The double-precision is actually new to this generation of Radeon and the OpenCL PDF document standard hasn't even been updated yet to include the data type, even though the API SDK header files have been.

Using the freeware GPU Caps software, the Radeon 7970 by itself (without assistance from my desktop computer's 3.3 Ghz Intel i5 2500) clocks in at 25x the computational power of the four-core (single processor) Intel i5 by itself.

To get a dual-processor Intel motherboard and second Intel processor is a $1000 increment, and that's only a 2x speedup, so a 25x speedup for a $500 increment isn't just a better deal, it's a new paradigm.  As Douglas Englebart said, a large enough quantitative change produces a qualitative change.

Up to four such cards can be ganged together in a single computer for a total 100x speedup.  But since each card is physically three cards wide (to accommodate the built-in liquid cooling and fans) even though it has just one PCIe connector, you will need a special rack-mount motherboard to go to that extreme (note I have not tried this!).

By comparison, to go 100x in the other direction, to get a computer with 1% of the computation power of my desktop i5, it would require going back 15 years to a Pentium II.  So a four-Radeon system represents a sudden 15-year leap into the future.