A NOVEL AUTOMATIC C TO NVIDIACUDA CODE OPTIMIZATION FRAMEWORK

: With the continuous demand for high performance computing, the need for reducing time for executing the application is the current challenge of research. Nevertheless, the execution time for the application not only depends on the hardware or architecture, rather also depends on the algorithm design. Improvement of the hardware may lead to higher investments and the optimization of cost is also to be taken care. Henceforth, the major optimization task is to focus on the algorithm design. A number of algorithm design techniques are available and techniques have reached the maximum of optimization levels. Thus, not limiting to the improvement in the algorithm design, the use of parallel execution of the programs is also to be considered. GPUs are commonly used processing units to speed up the application execution in the domain of game development. The GPUs can be utilized to parallelize the application execution to reach the clock usage to the maximum. The major challenge is to design or re-design the application code from traditional serial programming languages to the parallel codes, which can take the advantages of GPU cores. Nonetheless, the code conversion is not easy and demands a higher understanding of parallel programming and the GPUs are transparent to understand for a beginner. Thus the final demand for the application development industry is to build a code conversion framework to automatically convert the source code into parallel programs. This work presents a novel C to NVIDIA Cuda code converter and gives the legacy programs a chance to run on parallel architecture. This work, to be presented, can be considered as a base line for further reach and be used for bench marking the applications. The results demonstrate a high reduction in execution time.


INTRODUCTION
The evaluations of the GPUs are primarily caused by the enhancement of high graphics in the game development industry and scientific applications demanding more processing capabilities. The 3D rendering of the graphics modules of the games need the highly parallel and programmable pipelined processors. These can deliver parallel execution in significantly low cost. The measures of the performance of graphics processing units are completely taken over the performance of the central processing units. The notable works by Shane Ryoo et. al. [1] have demonstrated the improvements of execution time for multithreaded applications on GPUs compared to the CPUs. The surprising improvements of reducing execution time have forced multiple research organizations and processor architecting industries to build more sophisticated GPUs for general purpose floating point conversions and calculations. R. Kresch et al [2] have demonstrated the evaluation and scaling up of the general purpose GPUs from 1970 till the date. The preliminary focus for the development was to make the GPUs ready for general purpose calculations in order to cater the parallel processing benefits to the general purpose applications like scientific application or customer centric applications or the business applications or the financial applications. The recent advancements as demonstrated by D. L. N. Research [3] can delivery 500 Giga Flops, which is nearly four times improved, compared to the CPU cores available in the market.
Significant improvements demonstrated by the GPUs for the application development industry made a substantial impact among the researchers and the demands for programming in parallel languages have increased. Nevertheless, the programming in parallel languages that demands higher efficiencies, which is difficult to obtain due to invisibility of the GPU components, made the task challenging for application developers. In the other hand programming languages, which can take the benefits of parallel cores for any GPU, like CUDA has evolved. Yet, many legacy applications are built using C, a primary serial programming language, also demands to be upgraded to take the advantages of available GPU. Thus the conversion of the code is a primary task for the developers. This includes evaluation of kernels, an independent set of instruction finding, loop controlling and unrolling and finally the parallelization of the code using threads. Consequently, the bottleneck remains the same as demand for parallelization and building an expert development team. Nonetheless, this leads to a demand for finding the rule sets to convert the source code to CUDA codes automatically and take the recompenses of general purpose GPUs. This work presents a novel code conversion technique to convert legacy C source codes to the CUDA code. The rest of the work is organized such as in the Section -II literature is reviewed in order to understand the recent advancements of this domain of work, in Section -III CUDA architecture is reviewed to the possible extend in order to establish the framework theory for code conversion, in the Section -IV the algorithm is presented with the light of the mathematical models, in the Section -V the results are obtained and compared and in the Section -VI this work presents the conclusion.

LITERATURE SURVEY
The landmark for the parallel architecture was introduced by NVIDIA in the year of 2006 called Computer Unified Device Architecture or CUDA [3]. This invention was to open the gate for high performance computing on GPU to leverage the execution time for scientific or high graphical applications. The architecture was widely accepted by researchers and developers as the architecture was made available to personal computing and as well as to the servers running low to medium to high computational loads. Another reason for this wide acceptance was the use of multicore processors and shared memory architecture. The notable proof of this concept was presented by Shane Ryoo et. al [4] on performing highly complex scientific applications such as Fast Fourier Transform optimization. NVIDIA also developed a software development kit or SDK consisting of hardware simulation, drivers, libraries and device drivers for the benefit of the developers. CUDA software stack is composed of several layers: a hardware driver (CUDA Driver), an API and the runtime (CUDA Runtime), two higher-level mathematical libraries (CUDA Libraries) of general purposes [Fig 1]. The improvement of GPU performance over the traditional CPU architecture was evolved over the hardware organization. NVIDIA strongly recommended that in order to achieve the higher GPU utilization and optimal use of memory hierarchy are two major reasons for performance improvements of GPUs. The notable work by Christian Tenllado et. al [5] have founded the guidelines for a parallel code generation from a serial algorithm majorly focuses on this two principles.
Researchers represent several experiments aimed at analysing their relative importance. Results indicate that code transformations that target efficient memory usage are the major determinant of actual performance. Overall, they ensure the best performance even if some resources remain underutilized. Therefore, maximizing occupancy should be examined at a later stage in the compilation process, once data related issues have been properly addressed.
NVIDIA compiler NVCC can optimize code but the best optimized code is one should write at assembly level. But it looks very difficult in big algorithms and projects. So to find out occupancy is an important issue.
With the availability of the NVIDIA GPU, research focuses on the automatic code conversion techniques for serial codes into parallel. However, the automatic conversion is always debated due to lack of control during the code conversion. Henceforth, in order to overcome this designated problem, this work formulates all necessary guidelines formulated by various researchers by their notable works.

ARCHITECTURE OF CUDA
The CUDA architecture is designed and developed by the NVIDIA and for the benefit of the application developers and researchers NVIDIA also provides the sufficient understanding of the programming model and the shared memory architecture to utilize the benefits of available GPU. In this section the work furnishes the understanding linking to the serial to parallel code conversion techniques. Understanding Programming Model for CUDA The GPU is viewed as a compute device, that is a coprocessor to the CPU, has its own device Memory, and runs many threads in parallel [6] Data parallel portion of application are executed on the device as kernels which run in parallel on many threads. Difference between GPU and CPU thread [8] are: • GPU threads are extremely lightweight and require very little creation overhead. • GPU needs 1000s of threads for full efficiency whereas multicore CPU needs only a few.
There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. Blocks are organized into a one-dimensional or two-dimensional grid of thread blocks as illustrated by

A. Understanding Memory Model for CUDA
CUDA threads may access data from multiple memory spaces during their execution as illustrated by Figure 3. Each thread has private local memory. Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block. All threads have access to the same global memory. Henceforth with the detailed understanding, this work is now ready to propose the novel algorithm for the code conversion.

NOVEL CODE CONVERSION ALGORITHM
With the collected recommendations from various research attempts, this work proposes a novel algorithm to convert serial C programs, which is designed to run on the CPU, into a parallel CUDA C program, which can take the complete advantages of the benefits provided by GPUs. The proposed algorithm is described into 2 individual components as Algorithm 1 and Algorithm 2. Here the Algorithm 1 takes care of the conversion of functions and independence check for the functions, finally converts the functions into CUDA syntax. In the other hand, the second algorithm converts the basic syntaxes into CUDA C syntaxes and also converts the independent modules into CUDA threads to run on GPUs. The steps of the algorithm are described here:

Algorithm -1
Step-1. Read the C Source File Step-2. Find the initial variables and kernel variables The results of this algorithm is also been discussed in this work in the next section.

RESULTS AND DISCUSSION
The intension of this work is to demonstrate the improvement of the performance for parallel application over serial application. The automatic conversion of the source code is always debated and hence this work provides much larger and concrete proofs for the demonstration of the improvements. In this section, the results demonstrate the converted source code and analyses the serial and parallel execution time.

A. Analysis of the Performance on Binary Search
The first analysis is demonstrated on the popular binary search code. Firstly the source C program is converted into CUDA C automatically using the Novel Code Converter proposed in this work [  printf("The number is found\n\n"); } system("pause"); return 0; } Furthermore, the comparison for CPU time is also been analysed [  The result is also been analysed visually for CPU [Fig -4] & for GPU [Fig -5].   The result is also been visually analysed for CPU [ Fig -6] and GPU [Fig -7]

C. Analysis of the Performance on Vector Sum
Next, the analysis is demonstrated on the popular Vector Sum code. Firstly the source C program is converted into CUDA C automatically using the Novel Code Converter proposed in this work [   The result is also been analysed visually for CPU [ Fig -12] & for GPU [Fig -13].   Thus with the light of the obtained results, this work presents the conclusions in the next section.

CONCLUSIONS
The demand for automatic conversion of the serial code to parallel codes in order to reduce the execution time and defeat the fact of higher efficiencies in the workforce is always under a focus for the research. This work deploys a novel algorithm to convert serial C codes into parallel NVIDIA CUDA codes to take the maximum benefits from the GPUs available. The automatic conversion framework, proposed and demonstrated in this work, not only reduces the time for the conversion, also reduces 80% of the execution time for legacy serial programs from various algorithmic approaches. The conversion framework demonstrated a 100% similar result upon execution and works for all programs written following the fundamental guidelines of the code developments. This work is to be seen as one of the baseline for further research and a contribution towards automatic code translation for legacy systems in order to make the computational support for modern developments.