summaryrefslogtreecommitdiff
path: root/compiler/optimizing/loop_optimization.h
Commit message (Collapse)AuthorAgeFilesLines
* ARM64: Support SVE VL other than 128-bit.Artem Serov2021-02-051-1/+1
| | | | | | | | | | | | | | | | | | Arm SVE register size is not fixed and can be a multiple of 128 bits. To support that the patch removes explicit assumptions on the SIMD register size to be 128 bit from the vectorizer and code generators and enables configurable SVE vector length autovectorization, e.g. extends SIMD register save/restore routines. Test: art SIMD tests on VIXL simulator. Test: art tests on FVP (steps in test/README.arm_fvp.md) with FVP arg: -C SVE.ScalableVectorExtension.veclen=[2,4] (SVE vector [128,256] bits wide) Change-Id: Icb46e7eb17f21d3bd38b16dd50f735c29b316427
* ART: Implement predicated SIMD vectorization.Artem Serov2021-02-041-7/+13
| | | | | | | | | | | | | | This CL brings support for predicated execution for auto-vectorizer and implements arm64 SVE vector backend. This version passes all the VIXL simulator-runnable tests in SVE mode with checker off (as all VecOp CHECKs need to be adjusted for an extra input) and all tests in NEON mode. Test: art SIMD tests on VIXL simulator. Test: art tests on FVP (steps in test/README.arm_fvp.md) Change-Id: Ib78bde31a15e6713d875d6668ad4458f5519605f
* ART: Refactor SIMD slots and regs size processing.Artem Serov2020-04-171-2/+13
| | | | | | | | | | | | ART vectorizer assumes that there is single size of SIMD register used for the whole program. Make this assumption explicit and refactor the code. Note: This is a base for the future introduction of SIMD slots of size other than 8 or 16 bytes. Test: test-art-target, test-art-host. Change-Id: Id699d5e3590ca8c655ecd9f9ed4e63f49e3c4f9c
* Revert "Make compiler/optimizing/ symbols hidden."Vladimir Marko2019-10-141-2/+1
| | | | | | | | | This reverts commit e2727154f25e0db9a5bb92af494d8e47b181dfcf. Reason for revert: Breaks ASAN tests (ODR violation). Bug: 142365358 Change-Id: I38103d74a1297256c81d90872b6902ff1e9ef7a4
* Make compiler/optimizing/ symbols hidden.Vladimir Marko2019-10-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | Make symbols in compiler/optimizing hidden by a namespace attribute. The unit intrinsic_objects.{h,cc} is excluded as it is needed by dex2oat. As the symbols are no longer exported, gtests are now linked with the static version of the libartd-compiler library. libart-compiler.so size: - before: arm: 2396152 arm64: 3345280 - after: arm: 2016176 (-371KiB, -15.9%) arm64: 2874480 (-460KiB, -14.1%) Test: m test-art-host-gtest Test: testrunner.py --host --optimizing --jit Bug: 142365358 Change-Id: I1fb04a33351f53f00b389a1642e81a68e40912a8
* ART: ARM64: Support DotProd SIMD idiom.Artem Serov2018-09-251-0/+6
| | | | | | | | | | | | | | | | | | | Implement support for vectorization idiom which performs dot product of two vectors and adds the result to wider precision components in the accumulator. viz. DOT_PRODUCT([ a1, .. , am], [ x1, .. , xn ], [ y1, .. , yn ]) = [ a1 + sum(xi * yi), .. , am + sum(xj * yj) ], for m <= n, non-overlapping sums, for either both signed or both unsigned operands x, y. The patch shows up to 7x performance improvement on a micro benchmark on Cortex-A57. Test: 684-checker-simd-dotprod. Test: test-art-host, test-art-target. Change-Id: Ibab0d51f537fdecd1d84033197be3ebf5ec4e455
* Use 'final' and 'override' specifiers directly in ART.Roland Levillain2018-08-281-1/+1
| | | | | | | | | | | | | | | Remove all uses of macros 'FINAL' and 'OVERRIDE' and replace them with 'final' and 'override' specifiers. Remove all definitions of these macros as well, which were located in these files: - libartbase/base/macros.h - test/913-heaps/heaps.cc - test/ti-agent/ti_macros.h ART is now using C++14; the 'final' and 'override' specifiers have been introduced in C++11. Test: mmma art Change-Id: I256c7758155a71a2940ef2574925a44076feeebf
* ART: Implement loop full unrolling.Artem Serov2018-07-041-0/+6
| | | | | | | | | | | | Performs whole loop unrolling for small loops with small trip count to eliminate the loop check overhead, to have more opportunities for inter-iteration optimizations. caffeinemark/FloatAtom: 1.2x performance on arm64 Cortex-A57. Test: 530-checker-peel-unroll. Test: test-art-host, test-art-target. Change-Id: Idf3fe3cb611376935d176c60db8c49907222e28a
* ART: Refactor scalar loop optimizations.Artem Serov2018-07-041-4/+11
| | | | | | | | | | | | Refactor scalar loop peeling and unrolling to eliminate repeated checks and graph traversals, to make the code more readable and to make it easier to add new scalar loop opts. This is a prerequisite for full unrolling patch. Test: 530-checker-peel-unroll. Test: test-art-target, test-art-host. Change-Id: If824a95f304033555085eefac7524e59ed540322
* Move instruction_set_ to CompilerOptions.Vladimir Marko2018-06-251-4/+4
| | | | | | | | | | | | Removes CompilerDriver dependency from ImageWriter and several other classes. Test: m test-art-host-gtest Test: testrunner.py --host --optimizing Test: Pixel 2 XL boots. Test: m test-art-target-gtest Test: testrunner.py --target --optimizing Change-Id: I3c5b8ff73732128b9c4fad9405231a216ea72465
* ART: Enable scalar loop peeling and unrolling.Artem Serov2018-05-151-2/+2
| | | | | | | | Turn on scalar loop peeling and unrolling by default. Test: 482-checker-loop-back-edge-use, 530-checker-peel-unroll Test: test-art-host, test-art-target, boot-to-gui Change-Id: Ibfe1b54f790a97b281e85396da2985e0f22c2834
* Remove some SIMD recognition code.Aart Bik2018-05-011-8/+5
| | | | | | Test: : test-art-host,target Change-Id: I7f00315c61ed99723236283bc39a4c7fb279df47
* Step 1 of 2: conditional passes.Aart Bik2018-04-261-2/+2
| | | | | | | | | | | | | | | | | Rationale: The change adds a return value to Run() in preparation of conditional pass execution. The value returned by Run() is best effort, returning false means no optimizations were applied or no useful information was obtained. I filled in a few cases with more exact information, others still just return true. In addition, it integrates inlining as a regular pass, avoiding the ugly "break" into optimizations1 and optimziations2. Bug: b/78171933, b/74026074 Test: test-art-host,target Change-Id: Ia39c5c83c01dcd79841e4b623917d61c754cf075
* ART: Implement scalar loop peeling.Artem Serov2018-04-171-9/+4
| | | | | | | | | | | | | Implement scalar loop peeling for invariant exits elimination (on arm64). If the loop exit condition is loop invariant then loop peeling + GVN + DCE can eliminate this exit in the loop body. Note: GVN and DCE aren't applied during loop optimizations. Note: this functionality is turned off by default now. Test: test-art-host, test-art-target, boot-to-gui. Change-Id: I98d20054a431838b452dc06bd25c075eb445960c
* ART: Implement scalar loop unrolling.Artem Serov2018-03-261-2/+23
| | | | | | | | | | | | | | Implement scalar loop unrolling for small loops (on arm64) with known trip count to reduce loop check and branch penalty and to provide more opportunities for instruction scheduling. Note: this functionality is turned off by default now. Test: cloner_test.cc Test: test-art-target, test-art-host Change-Id: Ic27fd8fb0bc0d7b69251252da37b8b510bc30acc
* Vectorization of saturation arithmetic.Aart Bik2018-03-151-0/+6
| | | | | | | | | | | Rationale: Because faster is better. Bug: b/74026074 Test: test-art-host,target Change-Id: Ifa970a62cef1c0b8bb1c593f629d8c724f1ffe0e
* Refactored optimization passes setup.Aart Bik2017-11-201-1/+2
| | | | | | | | | | | | | | | Rationale: Refactors the way we set up optimization passes in the compiler into a more centralized approach. The refactoring also found some "holes" in the existing mechanism (missing string lookup in the debugging mechanism, or inablity to set alternative name for optimizations that may repeat). Bug: 64538565 Test: test-art-host test-art-target Change-Id: Ie5e0b70f67ac5acc706db91f64612dff0e561f83
* Alignment optimizations in vectorizer.Aart Bik2017-10-271-11/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rationale: Since aligned data access is generally better (enables more efficient aligned moves and prevents nasty cache line splits), computing and/or enforcing alignment has been added to the vectorizer: (1) If the initial alignment is known completely and suffices, then a static peeling factor enforces proper alignment. (2) If (1) fails, but the base alignment allows, dynamically peeling until total offset is aligned forces proper aligned access patterns. By using ART conventions only, any forced alignment is preserved over suspends checks where data may move. Note 1: Current allocation convention is just 8 byte alignment on arrays/strings, so only ARM32 benefits. However, all optimizations are implemented in a general way, so moving to a 16 byte alignment will immediately take advantage of any new convention!! Note 2: This CL also exposes how bad the choice of 12 byte offset of arrays really is. Even though the new optimizations fix the misaligned, it requires peeling for the most common case: 0 indexed loops. Therefore, we may even consider moving to a 16 byte offset. Again the optimizations in this CL will immediately take advantage of that new convention!! Test: test-art-host test-art-target Change-Id: Ib6cc0fb68c9433d3771bee573603e64a3a9423ee
* ARM: Support SIMD reduction for 32-bit backend.Artem Serov2017-10-121-0/+1
| | | | | | | | | | Support SIMD reduction (add, min, max) and SAD (for int->int only) idioms for arm (32-bit) backend. Test: test-art-target, test-art-host Test: 661-checker-simd-reduc, 660-checker-simd-sad-int Change-Id: Ic6121f5d781a9bcedc33041b6c4ecafad9b0420a
* ART: Use ScopedArenaAllocator for pass-local data.Vladimir Marko2017-10-061-6/+8
| | | | | | | | | | | | | | | | | Passes using local ArenaAllocator were hiding their memory usage from the allocation counting, making it difficult to track down where memory was used. Using ScopedArenaAllocator reveals the memory usage. This changes the HGraph constructor which requires a lot of changes in tests. Refactor these tests to limit the amount of work needed the next time we change that constructor. Test: m test-art-host-gtest Test: testrunner.py --host Test: Build with kArenaAllocatorCountAllocations = true. Bug: 64312607 Change-Id: I34939e4086b500d6e827ff3ef2211d1a421ac91a
* ART: Introduce compiler data type.Vladimir Marko2017-09-251-9/+9
| | | | | | | | | | | | Replace most uses of the runtime's Primitive in compiler with a new class DataType. This prepares for introducing new types, such as Uint8, that the runtime does not need to know about. Test: m test-art-host-gtest Test: testrunner.py --host Bug: 23964345 Change-Id: Iec2ad82454eec678fffcd8279a9746b90feb9b0c
* Implement Sum-of-Abs-Differences idiom recognition.Aart Bik2017-09-211-0/+6
| | | | | | | | | | | | | | | | | | Rationale: Currently just on ARM64 (x86 lacks proper support), using the SAD idiom yields great speedup on loops that compute the sum-of-abs-difference operation. Also includes some refinements around type conversions. Speedup ExoPlayerAudio (golem run): 1.3x on ARM64 1.1x on x86 Test: test-art-host test-art-target Bug: 64091002 Change-Id: Ia2b711d2bc23609a2ed50493dfe6719eedfe0130
* Pass stats into the loop optimization phase.Aart Bik2017-09-061-1/+2
| | | | | Test: market scan. Change-Id: I58b23b8d254883f30619ea3602d34bf93618d432
* Basic SIMD reduction support.Aart Bik2017-09-051-11/+19
| | | | | | | | | | | | | | | | | | | Rationale: Enables vectorization of x += .... for very basic (simple, same-type) constructs. Paves the way for more complex (narrower and/or mixed-type) constructs, which will be handled by the next CL. This is a revert of Icb5d6c805516db0a1d911c3ede9a246ccef89a22 and thus a revert^2 of I2454778dd0ef1da915c178c7274e1cf33e271d0f and thus a revert^3 of I1c1c87b6323e01442e8fbd94869ddc9e760ea1fc and thus a revert^4 of I7880c135aee3ed0a39da9ae5b468cbf80e613766 PS1-2 shows what needed to change Test: test-art-host test-art-target Bug: 64091002 Change-Id: I647889e0da0959ca405b70081b79c7d3c9bcb2e9
* Revert "Basic SIMD reduction support."Nicolas Geoffray2017-09-021-19/+11
| | | | | | | | | | Fails 530-checker-lse on arm64. Bug: 64091002, 65212948 This reverts commit cfa59b49cde265dc5329a7e6956445f9f7a75f15. Change-Id: Icb5d6c805516db0a1d911c3ede9a246ccef89a22
* Basic SIMD reduction support.Aart Bik2017-09-011-11/+19
| | | | | | | | | | | | | | | | | Rationale: Enables vectorization of x += .... for very basic (simple, same-type) constructs. Paves the way for more complex (narrower and/or mixed-type) constructs, which will be handled by the next CL. This is a revert^2 of I7880c135aee3ed0a39da9ae5b468cbf80e613766 and thus a revert of I1c1c87b6323e01442e8fbd94869ddc9e760ea1fc PS1-2 shows what needed to change, with regression tests Test: test-art-host test-art-target Bug: 64091002, 65212948 Change-Id: I2454778dd0ef1da915c178c7274e1cf33e271d0f
* Revert "Basic SIMD reduction support."Aart Bik2017-08-301-19/+11
| | | | | | | | | | | | This reverts commit 9879d0eac8fe2aae19ca6a4a2a83222d6383afc2. Getting these type check failures in some builds. Need time to look at this better, so reverting for now :-( dex2oatd F 08-30 21:14:29 210122 226218 code_generator.cc:115] Check failed: CheckType(instruction->GetType(), locations->InAt(0)) PrimDouble C Change-Id: I1c1c87b6323e01442e8fbd94869ddc9e760ea1fc
* Basic SIMD reduction support.Aart Bik2017-08-301-11/+19
| | | | | | | | | | | | | Rationale: Enables vectorization of x += .... for very basic (simple, same-type) constructs. Paves the way for more complex (narrower and/or mixed-type) constructs, which will be handled by the next CL. Test: test-art-host test-art-target Bug: 64091002 Change-Id: I7880c135aee3ed0a39da9ae5b468cbf80e613766
* Set basic framework for detecting reductions.Aart Bik2017-08-081-8/+33
| | | | | | | | | | | | | Rationale: Recognize reductions in loops. Note that reductions are *not* optimized yet (we would proceed with e.g. unrolling and vectorization). This CL merely sets up the basic detection framework. Also does a bit of cleanup on loop optimization code. Bug: 64091002 Test: test-art-host Change-Id: I0f52bd7ca69936315b03d02e83da743b8ad0ae72
* Unrolling and dynamic loop peeling framework in vectorizer.Aart Bik2017-06-271-8/+20
| | | | | | | | | | | | | | | | | | Rationale: This CL introduces the basic framework for dynamically peeling (to obtain aligned access) and unrolling the vector loop (to reduce looping overhead and allow more target specific optimizations on e.g. SIMD loads and stores). NOTE: The current heuristics are "bogus" and merely meant to exercise the new framework. This CL focuses on introducing correct code for the vectorizer. Heuristics and the memory computations for alignment are to be implemented later. Test: test-art-target, test-art-host Change-Id: I010af1475f42f92fd1daa6a967d7a85922beace8
* Fix loop optimization in the presence of environment uses.Nicolas Geoffray2017-06-221-1/+4
| | | | | | | | | | We should not remove instructions that have deoptimize as users, or that have environment uses in a debuggable setup. bug: 62536525 bug: 33775412 Test: 656-loop-deopt Change-Id: Iaec1a0b6e90c6a0169f18c6985f00fd8baf2dece
* MIPS64: ART VectorizerGoran Jakovljevic2017-05-291-0/+1
| | | | | | | | | | MIPS64 implementation which uses MSA extension. Also extended all relevant checker tests to test MIPS64 implementation. Test: booted MIPS64R6 in QEMU Test: ./testrunner.py --target --optimizing -j1 in QEMU Change-Id: I8b8a2f601076bca1925e21213db8ed1d41d79b52
* Support for narrow operands in "dangerous" operations.Aart Bik2017-05-241-1/+5
| | | | | | | | | | | | | This is a revert^2 of commit 636e870d55c1739e2318c2180fac349683dbfa97. Rationale: Under strict conditions, even operations that are sensitive to higher order bits can vectorize by inspecting the operands carefully. This enables more vectorization, as demonstrated by the removal of quite a few TODOs. Test: test-art-target, test-art-host Change-Id: Ic2684f771d2e36df10432286198533284acaf472
* Revert "Support for narrow operands in "dangerous" operations."Nicolas Geoffray2017-05-231-5/+1
| | | | | | | | Fails on armv8 / speed-profile This reverts commit 636e870d55c1739e2318c2180fac349683dbfa97. Change-Id: Ib2a09b3adeba994c6b095672a1e08b32d3871872
* Support for narrow operands in "dangerous" operations.Aart Bik2017-05-181-1/+5
| | | | | | | | | | | Rationale: Under strict conditions, even operations that are sensitive to higher order bits can vectorize by inspecting the operands carefully. This enables more vectorization, as demonstrated by the removal of quite a few TODOs. Test: test-art-target, test-art-host Change-Id: I2b0fda6a182da9aed9ce1708a53eaf0b7e1c9146
* Min/max SIMDization support.Aart Bik2017-05-151-0/+1
| | | | | | | | | Rationale: The more vectorized, the better! Test: test-art-target, test-art-host Change-Id: I758becca5beaa5b97fab2ab70f2e00cb53458703
* Implement halving add idiom (with checker tests).Aart Bik2017-04-191-7/+16
| | | | | | | | | | | Rationale: First of several idioms that map to very efficient SIMD instructions. Note that the is-zero-ext and is-sign-ext are general-purpose utilities that will be widely used in the vectorizer to detect low precision idioms, so expect that code to be shared with many CLs to come. Test: test-art-host, test-art-target Change-Id: If7dc2926c72a2e4b5cea15c44ef68cf5503e9be9
* Implemented ABS vectorization.Aart Bik2017-04-051-0/+1
| | | | | | | | | Rationale: This CL adds the concept of vectorizing intrinsics to the ART vectorizer. More can follow (MIN, MAX, etc). Test: test-art-host, test-art-target (angler) Change-Id: Ieed8aa83ec64c1250ac0578570249cce338b5d36
* ART vectorizer.Aart Bik2017-03-311-9/+106
| | | | | | | | | | | | | Rationale: Make SIMD great again with a retargetable and easily extendable vectorizer. Provides a full x86/x86_64 and a proof-of-concept ARM implementation. Sample improvement (without any perf tuning yet) for Linpack on x86 is about 20% to 50%. Test: test-art-host, test-art-target (angler) Bug: 34083438, 30933338 Change-Id: Ifb77a0f25f690a87cd65bf3d5e9f6be7ea71d6c1
* Pass driver to loop opt. Add new side_effects phase.Aart Bik2017-03-061-1/+8
| | | | | | | | | | | Rationale: Break-out CL of ART Vectorizer: number 3. The purpose is making the original CL smaller and easier to review. Bug: 34083438 Test: test-art-host Change-Id: I7cece807ee4f5fcaeae41f1deed33ac263447b77
* Complete unrolling of loops with small body and trip count one.Aart Bik2017-01-131-3/+5
| | | | | | | | | | | | Rationale: Avoids the unnecessary loop control overhead, suspend check, and exposes more opportunities for constant folding in the resulting loop body. Fully unrolls loop in execute() of the Dhrystone benchmark (3% to 8% improvements). Test: test-art-host Change-Id: If30f38caea9e9f87a929df041dfb7ed1c227aba3
* Added polynomial induction variables analysis. With tests.Aart Bik2016-12-091-0/+3
| | | | | | | | | | | | Rationale: Information on polynomial sequences is nice to further enhance BCE and last-value assignment. In this case, this CL enables more loop optimizations for benchpress' Sum (80 x speedup). Also changed rem-based geometric induction to wrap-around induction. Test: test-art-host Change-Id: Ie4d2659edefb814edda2c971c1f70ba400c31111
* Account for early exit loop.Aart Bik2016-11-041-3/+1
| | | | | | | | | | | Rationale: last value computation is obviously only right if the loop does not have early exits; only needed if cycle leaks to outside loop in any way. Bug:32633772 Test: 623-checker-loop-regressions Change-Id: Id60beca4704491cff611ad12a24bfc63c09d32c3
* Improved induction variable analysis and loop optimizations.Aart Bik2016-10-241-0/+4
| | | | | | | | | | | | | | Rationale: Rather than half-baked reconstructing cycles during loop optimizations, this CL passes the SCC computed during induction variable analysis to the loop optimizer (trading some memory for more optimizations). This further improves CaffeineLogic from 6000us down to 4200us (dx) and 2200us to 1690us (jack). Note that this is on top of prior improvements in previous CLs. Also, some narrowing type concerns are taken care of during transfer operations. Test: test-art-host Change-Id: Ice2764811a70073c5014b3a05fb51f39fd2f4c3c
* Enable last value generation of periodic sequence.Aart Bik2016-10-181-1/+1
| | | | | | | | | | | | | Rationale: This helps to eliminate more dead induction. For example, CaffeineLogic when compiled with latest Jack improves with a 1.3 speedup (2900us -> 2200us) due to eliminating first loop (second loop can be removed also, but for a later case). The currently benchmarks.dex has a different construct for the periodics, however, still to be recognized. Test: test-art-host Change-Id: Ia81649a207a2b1f03ead0855436862ed4e4f45e0
* Improved and simplified loop optimizations.Aart Bik2016-10-111-2/+8
| | | | | | | | | | | | | | | | | Rationale: Empty preheader simplification has been simplified to a much more general empty block removal optimization step. Incremental updating of induction variable analysis enables repeated elimination or simplification of induction cycles. This enabled an extra layer of optimization for e.g. Benchpress Loop (17.5us. -> 0.24us. -> 0.08us). So the original 73x speedup is now multiplied by another 3x, for a total of about 218x. Test: 618-checker-induction et al. Change-Id: I394699981481cdd5357e0531bce88cd48bd32879
* Improved and simplified loop optimizations.Aart Bik2016-10-071-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | Rationale: This CL merges some common cases into one, thereby simplifying the code quite a bit. It also prepares for more general induction cycles (rather than the simple phi-add currently used). Finally, it generalizes the closed form elimination with empty loops. As a result of the latter, elaborate but weird code like: private static int waterFall() { int i = 0; for (; i < 10; i++); for (; i < 20; i++); for (; i < 30; i++); for (; i < 40; i++); for (; i < 50; i++); return i; } now becomes just this (on x86)! mov eax, 50 ret Change-Id: I8d22ce63ce9696918f57bb90f64d9a9303a4791d Test: m test-art-host
* Refactoring of graph linearization and linear order.Aart Bik2016-10-051-9/+7
| | | | | | | | | | | | | | Rationale: Ownership of graph's linear order and iterators was a bit unclear now that other phases are using it. New approach allows phases to compute their own order, while ssa_liveness is sole owner for graph (since it is not mutated afterwards). Also shortens lifetime of loop's arena. Test: test-art-host Change-Id: Ib7137d1203a1e0a12db49868f4117d48a4277f30
* Make it possible to pass an arena allocator to HLoopOptimization.Nicolas Geoffray2016-10-051-0/+3
| | | | | | | | | | loop_optimization_test uses memory from HLoopOptimization's allocator, which is scoped by the Run method. Fix is to pass custom allocator. test: m test-art-host-gtest Change-Id: I359330e22202519f400a26da5403eeb00f0b2db4
* Properly scope HLoopOptimization's allocator.Nicolas Geoffray2016-10-051-1/+1
| | | | | | | | | | | HOptimization classes do not get their destructor called, as they are arena objects. So the scope for the optimization allocator needs to be the Run method. Also anticipate bisection search breakage by adding HLoopOptimization to the list of recognized optimizations. Change-Id: I7770989c39d5700a3b6b0a20af5d4b874dfde111