Wednesday, June 5, 2019

Solution of a System of Linear Equations for INTELx64

Solution of a System of Linear Equations for INTELx64A multi snapper hyper-th fileed solution of a system of one-dimensional equations for INTELx64 architectureRicha SinghalABSTRACT. A system of linear equations forms a very fundamental principal of linear algebra with very wide spread applications involving fields such(prenominal) as physics, chemistry and even electronics. With systems g languageing in complexness and demand of ever increasing clearcutness for results it becomes the need of the hour to have methodologies which can solve a large system of such equations to accuracy with alacritous functioning. On the other hand as frequency scaling is becoming limiting part to master performance utility of processors modern architectures atomic number 18 deploying multi incumbrance approach with features like hyper threading to meet performance requirements. The paper targets solving a system of linear equations for a multi core INTELx64 architecture with hyper threading u sing standard LU corruption methodology. This paper also presents a Forward seek LU decomposition approach which gives fall apart performance by effectively utilizing L1 save up of each processor in the multi core architecture. The sample uses as scuttle just straight offt a matrix of 40004000 double clearcutness floating point representation of the system.1. INTRODUCTIONA system of linear equations is a collection of linear equations of same variable. A system of linear equations forms a very fundamental principal of linear algebra with very wide spread applications involving fields such as physics, chemistry and even electronics. With systems g rowinging in complexity and demand of ever increasing precision for results it becomes the need of the hour to have methodologies which can solve a large system of such equations to accuracy with fastest performance. On the other hand as frequency scaling is becoming limiting factor to hit performance improvement. With increasing clo ck frequency the spring consumption goes upP = C x V2 x FP is power consumptionV is voltageF is frequencyIt was because of this factor totally that INTEL had to cancel its Tejas and Jayhawk processors. A newer approach is to deploy multiple cores which are capable to check process mutually exclusive tasks of a job to achieve the requisite performance improvement. Hyper threading is another method which makes a single core appears as two by using some additional registers. Having said that it requires that handed-down algorithms which are sequential in nature to be reworked and factorized so that they can efficiently utilize the processing power offered by these architectures.This paper aims to provide an effectuation for standard LU decomposition method used to solve system of linear equations adopting a forward seek methodology to efficiently solve a system of double precision system of linear equations with 4000 variable set. The proposed solution addresses all aspects of pro blem solving starting from file I/O to read the input system of equations to actually solving the system to generate required end product using multi core techniques. The solution assumes that the input problem has one and only one unique solution possible.2. CHALLENGESThe primary challenge is to rework the sequential LU decomposition method so that the revised framework can be decomposed into a set of independent problems which can be solved independently as far as possible. Then use this LU decomposition output and apply standard techniques of forward and backward substitution each again using multi core techniques to reach the final output.Another challenge associated is cache management. Since a set of 4000 floating point variable will take a memory approximately 32KB of memory and there will 4000 different equations put up together, hence efficiently managing all data in cache becomes a challenge. A forward seek methodology was used in LU decomposition which tries to keep the relevant data at L1 cache before it is required to be processed. It also tries to maximise operations on set of data once it is in cache so that cache misses are minimum.3. IMPACTWith a 40 core INTEXx64 machine with hyper threading the proposed method could achieve an acceleration of 72X in performance as compared to a standard sequential implementation.4. STATE OF THE ARTThe proposed solution uses state of the art programming techniques available for multithreaded architecture. It also uses INTEX ADVANCED VECTOR tidy sum (AVX) intrinsic instruction set to achieve maximum hyper threading. Native POSIX threads were used for the purpose. Efficient disk IO was made possible by mapping input vector file to RAM directly using mmap.5. PROPOSED SOLUTIONA system of linear equations representing CURRENT / VOLTAGE relationship for a set of resistances is be asRI = VSteps to solve this can be illustrated asDecompose R into L and USolve LZ = V for ZSolve UI = Z for IResistance matrix is mode lled as an array 40004000 of double precision floating type elements. The memory address being 16 byte aligned so that RAM portal speeds up for read and write operations.FLOAT RESMATRIX_SIZE*MATRIX_SIZE__attribute__((aligned(0x1000))) potential drop matrix is modelled as an array 40001 of double precision floating type elements. The memory address being 16 byte aligned so that RAM access speeds up for read and write operations.FLOAT V MATRIX_SIZE _attribute__ ((aligned(0x1000)))LU DecompositionTo solve the basic model of parallel LU decomposition as suggested to a higher place was adopted. Here as we move along the diagonal of the main matrix we calculate the factor values for Lower triangular matrix. Simultaneously each row operation updates elements for upper triangular matrix.Basic routine to do row operationThis routine is the innermost train routine which updates the rows which will eventually place the upper triangular matrix.For each element of row there is one subtractio n and one multiplication operation (highlighted).LOOP B designates row major(ip) operation, while LOOP A designates chromatography column major operation.Basic Algorithm crampfish LUDECOM (A, N)DO K = 1, n 1DO I = K+1, NAi, k = Ai, k / Ak, jDO j = K + 1, NAi, j = Ai, j Ai, k * Ak, j stopping point DOEND DOEND DOEND LUDECOMEach row major operation (LOOP B) iteration can be independently executed on a separate core. This was achieved by using POSIX threads which were non-blocking in nature. Because of mutual exclusion over the set of data MUTEX locks are not required provided we keep the column major operation (LOOP A) sequential.Also for 2 consecutive elements in one row operation 2 subtraction and 2 multiplication operations are done. These 2 operations each are done in single step using Single reading Multiple Data instructions (Hyper threading)Multi core AlgorithmSUB LUDECOM_BLOCK (A, K, BLOCK_START, BLOCK_END)DO I = BLOCK_START, BLOCK_ENDAi, K = Ai, K / AK, KDO j = K + 1, NAi , j = Ai, j Ai, K * Ak, KEND DOEND DOEND LUDECOM_BLOCKSUB LUDECOM (A, N)DO K = 1, N 1 BLOCK_SIZE = (N K) / MAX_THREADS wander = 0WHILE (Thread P_THREAD (LUDECOMPOSITION_BLOCK (A,K,Thread*BLOCK_SIZE,Thread*(BLOCK_SIZE + 1))ENDWHILEEND DOEND LUDECOMForward substitutionOnce LU decomposition is done, forward substitution gives matrix Z. Here again Single Instruction Multiple Data instructions are usedLZ = V for ZBackward substitutionAfter forward substitution final step of backward substitution gives online matrix IUI = Z for IHere again Single Instruction Multiple Data instructions are used5. CACHE IMPROVEMENTSOn profiling it is observed that the core processing in above solution happens to be LU decomposition. However if we create threads equal in number to available cores the result was improve but not in same proportion to the number of cores. A VALGRIND analysis of cache performance reveals that because of large size of matrix each row operation was suffering a performance hit due to cache misses happening.If we observe above solution it could be observed any jth is processed for (j 1) columns. So (j 1) threads are forked for each iteration of column major operation (LOOP A). The data to be processed refers to same memory location but by the time next operation or thread is forked for the same row the corresponding memory data had been pushed out of lower level caches. Thus cache miss happens.To solve this we adopted a forward seek approach wherein we first pre-process a set of columns sequentially thus enable more operations on a row to be performed in the same thread. Now the data happens to be at lower level cache as we do not have to wait for another thread to process the same row.Multi core Algorithm with forward seek operationSUB LUDECOM_BLOCK_SEEK (A, K, S, BLOCK_START, BLOCK_END)DO I = BLOCK_START, BLOCK_ENDDO U = 1, SM = K + U -1Ai, M = Ai, M / AM, jDO j = K + M + 1, NAi, j = Ai, j Ai, M * AK, MEND DOEND DOEND DOEND LUDECOM_BLOCKSUB LUDECOM ( A, N)K = 1WHILE (K //Forward seekDO J = K, K + F_SEEKLU_DECOM_BLOCK_SEEK (A, J, 0, J, J+F_SEEK)END DO//Multi coreK = K + F_SEEKDO L = 1, N 1 BLOCK_SIZE = (N L) / MAX_THREADSThread = 0WHILE (Thread P_THREAD (LUDECOMPOSITION_BLOCK (A,L,F_SEEK,Thread*BLOCK_SIZE,Thread*(BLOCK_SIZE + 1))ENDWHILEEND DOEND WHILEEND LUDECOMCONCLUSIONResultsFor purpose of computation a sample array of double precision floating point matrix of size 40004000 was taken. Performance numbers were generated on an 8 core INTEL architecture machine. put back 4.iA programmer that writes implicitly parallel code does not need to worry about task division or process communication, focusing sort of in the problem that his or her program is intended to solve. Implicit parallelism generally facilitates the design of parallel programs and therefore results in a substantial improvement of programmer productivity.Many of the constructs necessary to support this also add simplicity or clarity even in the absence of actual parallelism. The example above, of disposition comprehension in the sin() function, is a useful feature in of itself. By using implicit parallelism, languages effectively have to provide such useful constructs to users exactly to support required functionality (a language without a decent for loop, for example, is one few programmers will use).Languages with implicit parallelism reduce the control that the programmer has over the parallel execution of the program, resulting sometimes in less-than-optimal solution The makers of the Oz programming language also note that their early experiments with implicit parallelism showed that implicit parallelism made debugging unmanageable and object models unnecessarily awkward.2A larger issue is that every program has some parallel and some serial logic. Binary I/O, for example, requires support for such serial operations as Write() and Seek(). If implicit parallelism is desired, this creates a new requirement for constructs and keywords t o support code that cannot be threaded or distributed.REFERENCESGottlieb, Allan Almasi, George S. (1989).Highly parallel computing. redwood City, Calif. Benjamin/Cummings.ISBN0-8053-0177-1.S.V. Adve et al. (November 2008).Parallel Computing Research at Illinois The UPCRC Agenda(PDF). emailprotected, University of Illinois at Urbana-Champaign. The main techniques for these performance benefits increased clock frequency and smarter but increasingly complex architectures are now hitting the so-called power wall. The computer industry has accepted that future performance increases must largely come from increasing the number of processors (or cores) on a die, rather than making a single core go faster.Asanovic et al. Old conventional wisdom Power is innocuous, but transistors are expensive. New conventional wisdom is that power is expensive, but transistors are freeBunch, James R.Hopcroft, John(1974), Triangular factorization and inversion by fast matrix multiplication,Mathematics of Computation28 231236,doi10.2307/2005828,ISSN0025-5718.Cormen, Thomas H.Leiserson, Charles E.Rivest, Ronald L.Stein, Clifford(2001),Introduction to Algorithms, MIT Press and McGraw-Hill,ISBN978-0-262-03293-3.Golub, Gene H.Van Loan, Charles F.(1996),Matrix Computations(3rd ed.), Baltimore Johns Hopkins,ISBN978-0-8018-5414-9.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.