Template Recognition of Bytecodes and Execution

Abstract

This thesis describes the use of an existing SIMD (Single Instruction, Multiple Data) subroutine library in order to speed the execution of Java programs on MMX capable CPUs. We have developed a technology that makes use of a method substitution system which dynamically links the library on the appropriate CPU. The library is downloaded and the library paths are corrected at runtime, so that the user can have seem less speed-up experience. Our tests show an average-time speed up of 3 – 6 times.

1. Introduction

The goal of this research is to improve the performance of some Java programs by using the SIMD architectures (i.e., those that enable the fast processing of vectors via fine grained parallelism) via native methods.

We wrapped a SIMD library [33] using java. The library utilizes the registers present in the modern Intel processors and enable x86-based computers to accelerate the execution of java programs. At present, this code works for Linux based x86 processors with the MMX ability.

1.1 Problem Statement

We are given a SIMD-capable machine (MMX) and our goal is to make use of this instruction set in order to accelerate our Java programs. A series of experiments were conducted to demonstrate the speed-ups that are available for the code that makes use of the SIMD architecture present in the vector-processing unit. We are able to obtain typical speed-ups between 3 - 6 times.

1.2 Approach

The vector-processing unit has a set of registers, which are often unused by the java run-time. There are several subroutine libraries available which are already tuned to take advantage of these instructions. Performance speed-ups are achieved by forcing the virtual machine to use these registers for computation intensive methods via native calls.

Our approach to this research is to wrapper the given MMX accelerated native libraries, using Java, and to benchmark the execution of said program on a variety of algorithms including matrix multiplications, and a Dhrystone benchmark.

1.3 Motivation

Modern CPU’s are equipped with vector processing units, which have additional set of registers and are rarely used. As in the case of Java, slow performance is an important drawback. The Java virtual machine does not know how to use these vector processing units. In this thesis we introduce explicit calls to the vector processing units for computation intensive methods, via native libraries. This compels the Java virtual machine to use the available registers in the vector processing unit. Considerable speed-ups in execution can be achieved in this manner. This will lead to an improvement in the performance of programs written in Java.

2. Related Work

Considerable research has been done to enhance Java performance. Java seems to have problems with the way it organizes and manages multidimensional arrays, which pose performance problems; customized array packages have been developed to address these problems. These arrays are used for high performance computing, they have their own set of invocations to perform array computing in optimized fashion [11, 13, 15, 17]. Exception management seems to add overheads to Java performance, techniques to eliminate unwanted exception have been proposed. In cases where exceptions are used in iterations, worst case in nested iterations; algorithms exist to modify the bytecodes, such that these iterations could be wrapped into a single exception, there by reducing the overheads, and contribute towards performance speed-ups [4]. Since FORTRAN, C and C++ seem to perform better than java, techniques have been proposed to convert java source or bytecodes to native language, may be C, C++, ASSEMBLY or FORTRAN. Sometimes, profilers [18] are used to detect methods which are computational intensive, and use most of the execution time, such methods are converted to native code, and wrappers are generated for such native and code and placed in the bytecodes, such that these invocations would use the native code, instead of the java code. Hybrid compilation models have been proposed, selection strategies are used there may be more than one method waiting to be translated to native code due to the concurrent execution, the decision as to which method must go first will effect the performance of the VM, such decisions are made on the basis of historical runs. On the other hand the selection can be made between JIT and interpretation [12, 20]. Other techniques include manipulation of the JIT to the working of the virtual machine [2]. Considerable research as to how the underlying hardware architecture could be used to enhance Java compilation techniques is described in [1, 17]. Just In time compilation Heuristics have been proposed to manipulate bytecodes to take advantage of the hardware architecture organization. Such heuristics, target specific architecture like IA64 [9, 10]. Tools and algorithms exist which can automatically detect implicit loop parallelism and exploit them at bytecode level [3, 5, 6, 7, 16].

3. Analysis of some Existing Java Virtual Machines vs. C Compiler

There are several virtual machines available, some are based on the SUN Microsystems virtual machine [21] and other virtual machines exist, which are independent of the SUN Microsystems implementation, like the GCJ for Linux. Each of these virtual machines has its own perspective to compile and run programs. The graph below depicts some of the Virtual machines and their performance. The benchmark used is the Dhrystone benchmark. The Dhrystone benchmark is executed for 500000 cycles in each case.

Fig 3.1 Analysis of Dhrystone benchmark

Note: Opt refers to compilation with optimizations on.

The graph clearly indicates the performance of C is comparatively better than any java virtual machine. Results clearly indicate that the performance improvements are achieved due to the introduction of native methods. In this thesis we attempt to achieve better performance by using native code via JNI from Java. There are several other high performance libraries for scientific and engineering purposes, like the JAMA [22] Java matrix high performance library for matrix multiplication, purely implemented in Java. Another pure java high performance library is Lyon & Rao [23], a library that they wrote for their book. The Java Advanced Imaging library uses a combination of native code and Java to obtain speed-ups. InAspect signal and image processing package for java exists which target essentially embedded systems mainly. [24]

4. MMX Overview

MMX is an improvement, added to Intel processors. The chances of existence of MMX registers in any modern computer you purchase are very high. The MMX registers present in these modern CPU, can boost performance anywhere between 5% - 20%. Dramatic performance gains are obtained on applications which are compiled specifically, to use the MMX instruction set. There are 57 instruction sets that can be used to manipulate these registers; these can also be used by applications to take advantage of the underlying architecture.

Pentium processors with MMX also use a technology known as Single Instruction Multiple Data, or SIMD. In a nutshell, this technique allows the chip to reduce the number of processing loops that are usually associated with multimedia functions. By utilizing SIMD, the same operation can be performed on multiple sets of data at the same time [36].

The MMX technology added 8 registers, to the architecture. They are numbered from MM0 – MM7, or simply referred to as MMn. Each of these registers is 64 bits wide. In reality, these new "registers" were just aliases for the existing x87 FPU stack registers.

The MMX technology also introduces four new data types: three packed data types (bytes, words and double words, respectively being 8, 16 and 32 bits wide for each data element) and a new 64-bit entity.

The four MMX technology data types are:

Packed byte -- 8 bytes packed into one 64-bit quantity

Packed word -- 4 16-bit words packed into one 64-bit quantity

Packed double word – 2 32-bit double words packed into one 64-bit quantity

Quad word -- one 64-bit quantity

Let’s consider an example where pixels are represented by 8-bit integers, or bytes. With MMX technology, eight of these pixels are packed together in a 64-bit quantity and moved into an MMX register; when an MMX instruction executes, it takes all eight of the pixel values at once from the MMX register, performs the arithmetic or logical operation on all eight elements in parallel, and writes the result into an MMX register. The degree of parallelism that can be achieved with the MMX technology depends on the size of data, ranging from 8 when using 8-bit data to 1, i.e. no parallelism, when using 64-bit data [8]. The MMX instruction set cover several functional areas, arithmetic, comparison, conversion, logical, shift operations, data transfer and state management.

For typical MMX syntax refer Appendix A

The Java virtual machine does not know how to use these additional registers. In this thesis we attempt to force the usage of MMX instruction set, via native calls to C libraries which have the MMX/SE2 instruction sets in lined.

5. Method Substitution and SIMD Library

The working system can be categorized into a method substitution module and a SIMD library module.

5.1 Method Substitution System

The method substitution system can take bytecodes as input. The program will span bytecodes and recognize the methods which can be accelerated. The program also gives an option for selective acceleration of methods.

Fig 5.1.1 Method substitution system

The heart of the method substitution system is a text file. The text file consists of method prototypes, which have to be detected and their equivalent accelerated functions. In this way the user has full control over the methods replaced by the method substitution system. The method substitution system uses Javassist [32], to read class files, and builds a list of existing methods and replaces the method calls at bytecode level.

Working of Method Substitution System

Assuming this line exists in the prototype text file

getSQRT ( float )float -- nativeImplementation.MMXMath.sqrt($$)

This infers the name of the method to be searched is getSQRT, which takes one parameter of type float and also returns value of type float. Once this method is found the call will be replaced with nativeImplementation.MMXMath.sqrt at bytecode level, which also takes in a float and returns a float.

Original Source code

System.out.println(“Square root Value is “ + getSQRT(x));

Modified Class when decompiled, will have

System.out.println(“Square root Value is “ + nativeImplementation.MMXMath.sqrt(x));

5.2 SIMD Library

The SIMD library in java was built from existing library, written in C and assembly [33]. The SIMD library makes explicit calls to the registers in the Intel processors, using SSE and MMX instruction set. This existing library could be accessed within java via native methods.

Native methods in Java are methods written in some other language usually C or C++. Sun included Java Native Interface since JDK 1.1. The Java Native Interface may be a powerful framework, which could be used to benefit from both worlds Java and C/C++. The JNI renders native application with much of the functionality of Java, allowing the programmers to call Java methods and access Java variables. The Java native interface has several drawbacks, such as runtime, communication overhead and tedious to program.

There are several tools available that promise to automate the native interface wrapper generation, like JNIPP [25], JACE [26], Noodle Glue [27], Start Java [28], JNI Wrapper [34] and JACAW [29]. We used the JACAW package, to generate the wrappers for existing C functions. Although the JACAW generated native wrappers, manual efforts were applied to resolve bugs.

Fig 5.2.1 SIMD Library

5.3 Implementing the SIMD Library in Java

The Java Wrappers for SIMD library were written manually with little help from an existing native method wrapping tool JACAW [29]. Factory design pattern is used as a framework for the Library. While using native methods, the native library must be in the java class path. The native library is a shared object (.so) file in Linux. When a call is made to native method the JVM looks for the library, else throws errors. To ensure proper working of the native methods, these libraries must be included in class path, either manually or dynamically. The framework also ensures that these libraries are in path, before native calls are executed. The framework also detects the existence of MMX support, before attempting to use the native libraries.

Fig 5.3.1 Factory Interface for SIMD Library

The factory interface is an interface which holds the method definitions for all the methods supported by the Java SIMD Library. The Native implementation and Java Implementation are classes, which implement the Factory Interface. The only difference being the native implementation resolves to native calls, and the Java implementation resolves to pure Java calls, which act as an alternative on failure to load native methods.

Factory Interface for Math Methods: code extract

public interface MathInterface {

public double sqrt(double value);

}

Native Implementation for Math Methods: code extract

public class MMXMathImplementation implements MathInterface {

public static double sqrt(double value) {

//Call the Native Implementation

return (NativeMath.sqrt(value));

}

Java Implementation for Math Methods: code extract

public class JavaMathImplementation implements MathInterface {

public static double sqrt(double value) {

//Call the Native Implementation

return (Math.sqrt(value));

}

Generic Implementation for Math Methods

public class GenericMath {

public static double sqrt(double value){

if(isNativeLibFixed){

return MMXMathImplementation.sqrt(value);

}

else {

return JavaMathImplementation.sqrt(value);

}

Usage

System.out.println(GenericMath.sqrt(doubleValue));

Please refer Appendix B for a complete method listing.

5.4 Handling Failures to Load Native Library

To ensure the proper working of the SIMD Library, the shared libraries must be in path, (i.e. .so files for Linux). The JVM attempts to look for the library, when the following lines are encountered.

static {

System.loadLibrary(.SO/.DLL)

}

Failure to the load the library may occur due to (1) Native Library is absent; (2) Native Library is present but is not included in java library path (3) The Native library present is of an older version. The native library errors are fixed, by altering the class path during runtime [35].

Fixing Native Libraries

public class FixNativeLib {

   private boolean isNativeLibFixed;

   public FixNativeLib() {

   ManageNativeLib manageNativeLib =

                        new ManageNativeLib();

        try {

            //Attempt to load Libraries

            LoadLibs(manageNativeLib);

            //Check for MMX

            if (DetectMMX.detectMMX() == 1){

                isNativeLibFixed = true;

            else {

                isNativeLibFixed = false;

     catch (UnsatisfiedLinkError e) {

     System.out.println("Error, Using pure java methods");

            isNativeLibFixed = false;

        catch (Exception e) {

     System.out.println("Error, Using pure java methods");

     isNativeLibFixed = false;

6. Benchmarks

6.1 SIMD Library

All tests were performed on 2.0 GHz clock speed, Linux PC.

Dhrystone Benchmark analysis, the Dhrystone benchmark was performed for 500000 cycles.

Fig 6.1.1 Dhrystone benchmark analysis

Computed using	Dhrystones	Time (ms)
JAVA	2702702	185
JNI	2841361	186
GCC	3067936	122
JNI optimized (Vectorized)	9435564	178
GCC optimized	12501875	98

Matrix multiplication benchmark

Matrix multiplication was performed as a benchmark on M x N matrices. A comparison was made against JAMA (Java Matrix), a library written purely in java and standard JDK. The matrix multiplication was performed for a 100 cycles, in each case.

Fig 6.1.2 Analysis of Matrix multiplication benchmark

LIB Used	Matrix Size M x N	MFLOPS	Time in Secs
JAMA	64X64	225.2	1.80
	128X128	202.428	2.072
	256X256	22.477	41.79
	512X512	3.501	306.70
JDK	64X64	139.3	2.77
	128X128	135.0822	3.10
	256X256	22.155	42.405
	512X512	2.253	476.521
SIMD	64X64	511.5	0.31
	128X128	424.152	0.671
	256X256	242.207	3.878
	512X512	51.067	21.026

JNI Overhead for SIMD Library

Matrix Size	JNI Overhead in S
64 X 64	0.04
128 X 128	0.09
256 X 256	0.33
512 X 512	1.07

6.2 Method Substitution system

Let us consider the program.

Fig 6.2.1 Sample Java program

We compile the program, and the following output is obtained

Fig 6.2.2 Compile the program

Now we invoke the method substitution system, and choose the .class file generated above.

Fig 6.2.3 First screen of the Method substitution system

Click next, the method substitution system will span the methodprototype.txt file and list all the methods which have an entry in the methodprototype.txt and in our test program.

Fig 6.2.4 Second screen, with the methods to be accelerated are listed

In this case the system detects calls to getSQRT() defined in test program , and allows the user to substitute the calls to this method using SIMD function,

nativeImplementation.MMXMath.getInstance().rsqrt(float)

The method substitution information is obtained from the methodprtotype.txt file.

In this case we use the rsqrt (yields 1/sqrt) from the MMXMath package for test purposes.

Choose the methods to be modified from the list and click done.

Now we attempt to execute the modified class file, without re-compiling.

Fig 6.2.5 executing the modified class file, without re-compiling

The output obtained is different as

nativeImplementation.MMXMath.getInstance().rsqrt(float)

Is called instead of the actual getSQRT() method.

7. Conclusion

The SIMD package is designed to enhance the performance of some Java programs on CPU architectures with MMX support. The library interfaces with an existing native SIMD x 86 library, to achieve higher performance than pure java. The Library is also coupled with a method substitutions system to fairly automate the process of accelerated method substitution at bytecode level.

8. Future Research

In this thesis we have presented a method to use the underlying CPU architecture to boost performance of Java. The SIMD Library built, supports small set of core functions, which utilize the MMX instruction set. This library can be easily extended to support various other functions in the fields of Numerical computing, Image processing and signal processing. The SIMD Library supports only the Intel processors; this library could be extended to support other processor, like the Motorola G4 processors, with ALTIVEC technology. A new function could be added to the framework to use the Library based on the underlying processor.

The SIMD Library communicates with native code via Java Native Interface. The communication overhead is very high when using JNI, sometimes the communication overhead could kill the performance gains achieved, by native code. Efforts have been made to minimize the communication overhead and optimized usage of code, but errors and inconsistencies may exist. The binary distribution of the SIMD library target the Linux operating systems only, these could be extended to support other operating systems like windows and Mac OS.

The method substitution system is too strict and solely depends on the text file. It would be better to devise a method to recognize bottlenecks in code using a profiler and substitute accelerated library calls.

Appendix A

MMX Instruction Syntax

A typical MMX instruction has this syntax [8]:

Prefix: P for Packed

Instruction operation: for example - ADD, CMP, or XOR

Suffix: US for Unsigned Saturation, S for Signed saturation.

B, W, D, Q for the data type: packed byte, packed word, packed double word, or quad word

As an example, PADDSB is a MMX instruction (P) that sums (ADD) the 8 bytes (B) of the source and destination operands and saturates the result (S).

Appendix B

//returns the square root of double

double sqrt(double value)

//returns the square root of float

float sqrtf(float value)

//returns the reciprocal of square root of double

double rsqrt(double value)

//returns the reciprocal of square root of float

float rsqrtf(float value)

// returns Difference of 3 x 3 matrix

float[][] mat3Diff(float[][] m1, float[][] m2)

// Multiplication of matrices of size 3 x 3

float[][] mat3Mul(float[][] m1, float[][] m2)

//Scales the matrix of size 3 x 3 using float

float[][] mat3ScaleOf(float[][] arg0, float m)

//Sum of Matrices of size 3 x 3

float[][] mat3Sum(float[][] m1, float[][] m2)

//transpose of matrix of size 3 x 3

float[][] mat3TransposeOf(float[] arg0)

//Difference of Matrices of size 4 x 4

float[][] mat4Diff(float[][] m1, float[][] m2)

//Multiplication of matrices of size 4 x 4

float[][] mat4Mul(float[][] m1, float[][] m2)

//Scale of a matrix of Size 4 x 4

float[][] mat4ScaleOf(float[][] arg0, float m)

//Sum of matrices of size 4 x 4

float[][] mat4Sum(float[][] m1, float[][] m2)

//Transpose of matrix of size 4 x 4

float[][] mat4TransposeOf(float[] arg0)

//Difference of matrix (3 x 3 || 4 x 4) One Dimesional //array

float[] matrixDiff(float[] m1, float[] m2, int a, int b, int c, boolean t)

//Multiplication of matrix (3 x 3 || 4 x 4) One Dimesional //array

float[] matrixMul(float[] m1, float[] m2, int a, int b, int c, boolean t)

//Scale of matrix (3 x 3 || 4 x 4) One Dimesional array

float[] matrixScaleOf(float[] arg0, int retSize, int arg0Size, float m, boolean needCopy)

//Sum of matrix (3 x 3 || 4 x 4) One Dimesional array

float[] matrixSum(float[] m1, float[] m2, int a, int b, int c, boolean t)

//Transpose of matrix (3 x 3 || 4 x 4) One Dimesional array

float[] matrixTransposeOf(float[] arg0, int retSize, int arg0Size, boolean needCopy)

//Conjugate of a quaternion

float[] conjugate(float[] f1)

//Length of quaternion

float length(float[] f1)

//Length square of quaternion

float lengthSq(float[] f1)

//Multiplication of quaternion

float[] multiply(float[] f1, float[] f2)

//Normalized quaternion

float[] normalize(float[] f1)

//Rotation of a quaternion

float[] rotate(float[] f1, float rads, float x, float y, float z)

//Quaternion to matrix, One Dimensional array

float[] toMatrix(float[] f1)

//Cross of Vectors

float[] vectorCrossOf(float[] f1, float[] f2)

//Difference of vectors

float[] vectorDiff(float[] f1, float[] f2)

//Dot Product of Vectors

float vectorDot(float[] f1, float[] f2)

//Normalize a vector

float[] vectorNormalizeOf(float[] f1)

//Scale a vector

float[] vectorScale(float[] f1, float f2)

//Sum of Vectors

float[] vectorSum(float[] f1, float[] f2)

//Facilitates efficient Addition of huge block matrices

float[][] BlockAdds(float[][] f1, float[][] f2)

//Facilitates efficient Subtraction of huge block matrices

float[][] BlockDiff(float[][] f1, float[][] f2)

//Facilitates efficient Multiplication of huge block //matrices

float[][] BlockMul(float[][] f1, float[][] f2)

//Facilitates efficient scaling on large block Matrix

float[][] BlockScaleOf(float[][] f1, float f2)

//Facilitates efficient Transpose of block matrix

float[][] BlockTarnspose(float[][] f1)

8. References

[1] Douglas Lyon:

Java Optimization for Super scalar and Vector Architectures

Journal of Object Technology, vol. 4, no. 2, 2005, pp. 27-39

[2] Qiaoyin Li:

Java Virtual Machine – Present and near Future

Proceedings of the Technology of Object-Oriented Languages and Systems, 3-7 Aug 1998 Page: 480

[3] Austin Kim and Morris Chang:

Java Virtual Machine Performance Analysis with Java Instruction Level Parallelism and Advanced Folding Scheme.

Performance, Computing, and Communications Conference, April 3-5, 2002, 21st IEEE International, pages 9-15

[4] Kazuaki Ishizaki, Tatushi Inagaki, Hideaki Komatsu, Toshio Nakatani:

Eliminating Exception Constraints of Java programs for IA-64

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, page 259

[5] Pascal A. Felber:

Semi-Automatic Parallelization of Java Applications.

Institut EURECOM, 06904 Sophia Antipolis, France,

[6] Pedro V. Artigas, Manish Gupta, Samuel P.Midkiff, Jose E Moreiera:

Automatic Loop Transformation and Parallelization for Java

Proceedings of the 14th international conference on Supercomputing, May 8-11 2000, pages 1-10

[7] Aart J.C. Bik and Dennis B. Gannon:

JAVAB – A prototype bytecode Parallelization tool.

Computer Science dept. Indiana university (Jul 1997), 52 pages

[8] MMX Primer:

http://www.tommesani.com/MMXPrimer.html

[9] Svante Arvedahl:

Java Just In Time Compilation for IA-64 Architecture

Ling Ping University, 18th December 2002

[10] Samuel K. Sanseri:

Toward an Optimizing JIT Compiler for IA-64

Master's thesis, Portland State University, OR, US. 2000.

[11] Pedro V. Artigas , Manish Gupta , Samuel P. Midkiff , José E. Moreira, High Performance Numerical Computing in Java: Language and Compiler Issues, Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing p.1-17, August 04-06, 1999

[12] Iffat H. Kazi, Howard. H. Chen, Berdenia Stanely, and David J. Lilja:

Techniques for Obtaining High Performance in Java Programs. ACM Comput Surv 32(3): pages 213-240, year 2000

[13] Jose E Moriera, Samuel. P. Midkiff, Manish Gupta:

A Standard Java Array Package for Technical Computing

In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, March, 1999

[14] Tim Lindholm and Frank Yellin:

Java Virtual Machine Specification

1999 - Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA

[15] Ronald. F. Boisvert, Jack. J. Dongarra, Roldan Pozo, Karin. A. Remington, G.W. Stewart:

Developing Numerical Libraries in JAVA

ACM Concurrency, Pract. Exp. (UK), 10(11 - 13): 1117-29, September-November 1998

[16] Ian Piumarta and Fabio Riccardi:

Optimizing direct Threaded code by selective in lining.

INRIA Roquencourt, B.P. 105, 78153 Le Chesnay Cedex, France

[17] Matthias Schwab, Joel Schroeder:

Algebraic Java classes for Numerical Optimizations

Java high performance network computing, 1998 volume 10 Issue 11-13, pages 1155-1164

[18] Jack K. Shirazi:

Java Performance Tuning

O’Reilly publications, September 2000

[19] Zheng Weimin, Zheng Fengzhou, Yang Bo, Wang Yanling:

A Java Virtual Machine Design Based on Hybrid Concurrent Compilation Model

Proc conf technology object oriented language system tools, no. tool 36, pp. 18-23. 2000

[20] Vector Signal Image Processing Library:

http://www.vsipl.org

[21] Sun Microsystems:

http://java.sun.com/

[22] JAMA (Java Matrix) high performance library:

http://math.nist.gov/javanumerics/jama/

[23] Lyon and Rao:

Java Digital Signal Processing

M & T Books; Book & CD edition (November 30, 1997)

http://www.docjava.com

[24] Tory Alfred, Vijay P Shah, Anthony Skjellum and Nicholas H Younan:

InAspect: interfacing Java and VSIPL applications

ACM Java Grande-ISCOPE Conference Part II, 2002. Volume 17, Issues 7-8, pages 919-940

[25] JNI++:

http://sourceforge.net/projects/jnipp

[26] Jace:

http://sourceforge.net/projects/jace/

[27] Noodle Glue:

www.noodleheaven.net/JavaOSG/javaosg.html

[28] Start Java:

http://sourceforge.net/projects/startj

ava

[29] Jacaw:

http://users.cs.cf.ac.uk/Yan.Huang/research/JACAW/JACAW.htm

[30] ASM:

http://asm.objectweb.org/

[31] BCEL:

http://bcel.sourceforge.net/

[32] Javassist:

http://www.csg.is.titech.ac.jp/~chiba/javassist/

[33] SIMD Library:

http://sourceforge.nete/projects/simdx86

[34] JNI Wrapper:

http://www.jniwrapper.com

[35] Dr. Douglas Lyon:

http://www.docjava.com

[36] MMX Article:

http://www.netpanel.com/articles/computer/mmx.htm