| Commit message (Collapse) | Author | Age | Files | Lines |
| ... | |
| |/
|
|
|
|
|
|
|
|
|
| |
Bonus changes:
* Add a undefined Thumb instruction between the last code block and the
beginning of PC reconstruction cell to capture such codegen problem on
the spot.
* Fix a loop formation problem to exclude nested loops.
Bug: 4320840
Change-Id: I49d3fbba0073d8c2d4a0b241258239cb952c6bdd
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See http://b/issue?id=4271784 for details.
Three fixes:
1. Verify the code cache version hasn't changed between completion
of the translation and registering it in JitTable
2. When code cache full detected during translating a trace, mark
the "discard" flag on the work order.
3. [The actual cause of the bug] When doing a code cache flush,
traverse the thread least and cancel any trace selections in
progress.
Change-Id: Ifea70416d7d91637fb742fc8de11044a89358caa
|
| |
|
|
|
|
|
|
|
|
|
|
| |
On SMP systems, Dalvik opcodes referencing volatile fields will be
rewritten to their _VOLATILE variant. On non-SMP systems, though,
this rewriting is not done. The JIT, however, needs to know about
volatility for all systems in order to avoid performing unsafe
optimizations. This change fixes the JIT's volatility test to be
either _VOLATILE opcode or the volatile flag in the field access bits
depending on SMP type.
Change-Id: I2edde58dc25f22cba88f62c5f1a2d125473309e6
|
| |
|
|
| |
This reverts commit 11fb99d598ebe640719743a0d3bd7ed091e5be03.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
On SMP systems, Dalvik opcodes referencing volatile fields will be
rewritten to their _VOLATILE variant. On non-SMP systems, though,
this rewriting is not done. The JIT, however, needs to know about
volatility for all systems in order to avoid performing unsafe
optimizations. This change fixes the JIT's volatility test to be
either _VOLATILE opcode or the volatile flag in the field access bits
depending on SMP type.
Change-Id: If485875c5abbf0147c3ef4b6d557faa89ed85426
|
| |
|
|
|
|
|
|
| |
In the past, it was possible to have a volatile field with a
non-volatile opcode. This is no longer the case, so this change
eliminates the volatile field flag check.
Change-Id: I1cface4e813144634b2f90732c76b0a16f08c304
|
| |
|
|
| |
Change-Id: I79cdac62baa40582bba160a04cbd4c8b2c9151a5
|
| |
|
|
|
|
|
| |
Fix a deadlock situation.
Bug: 4192964
Change-Id: I27f869d90d58f67e675a65444ebed6fdf2a5f518
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adapt the existing counted loop analysis and range/null check
elimination code to work with the new loop building heuristics.
Cleaned up the old ad-hoc loop builder.
Suspend polling is enabled by default for loops. The backward chaining
cell will be used in self-verification and profiling mode.
If the loop includes accesses to resolved fields/classes, abort code
generation for now and revert to the basic acyclic trace. Added
tests/090-loop-formation to make sure the JIT won't choke on such
instructions.
Change-Id: Idbc57df0a745be3b692f68c1acb6d4861c537f75
|
| |\ |
|
| | |
| |
| |
| |
| |
| | |
Neglected to git-add the most recent version in Change I0c0b6d17.
Change-Id: I5f9e630f652edcf70ab893ade6f559056ed31f8f
|
| |\| |
|
| | |
| |
| |
| |
| |
| |
| |
| |
| | |
We were missing a dozen or so of the new jumbo ops in the JIT's
code generator. Interestingly, our compiler error recovery mechanism was
good enough that we didn't notice (we die on an assert build,
but silently recover and continue on a production build).
Change-Id: I0c0b6d1704c47e81b39c7dcf7d1172dbdcd29856
|
| |/
|
|
| |
Change-Id: If38ce20b2ac30488118259e3b3bdfbc9f068e5c3
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix a few miscellaneous bugs from the interpreter restructuring that were
causing a segfault on debugger attach.
Added a sanity checking routine for debugging.
Fixed a problem in which the JIT's threshold and on/off switch
wouldn't get initialized properly on thread creation.
Renamed dvmCompilerStateRefresh() to dvmCompilerUpdateGlobalState() to
better reflect its function.
Change-Id: I5b8af1ce2175e3c6f53cda19dd8e052a5f355587
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since the assembler is very robust and will recover from such problems,
adding the verbose/noisy mode will make it easier to detect overly
aggressive optimizations that don't actually work.
Example:
D/dalvikvm( 2348): Assembler abort #1 on 1
D/dalvikvm( 2348): kThumbBCond@16: delta=260
:
Instruction at 0x16 is a conditional branch:
D/dalvikvm( 2348): 0x16 (0016): beq 0x0000001a (L0xb6c0c)
:
Label at L0xb6c0c is a PC reconstruction cell:
D/dalvikvm( 2348): L0xb6c0c:
D/dalvikvm( 2348): -------- reconstruct dalvik PC : 0x401854d6 @ +0x002b
D/dalvikvm( 2348): 0x11e (011e): ldr r0, [r15pc, #0]
D/dalvikvm( 2348): 0x122 (0122): b 0x00000126 (L0xb685c)
where 0x11e - 0x16 - 4 = 260
Change-Id: Icbc3dae581949f5976722e24e38f04ec882c7d79
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a restructuring of the Dalvik ARM and x86 interpreters:
o Combine the old portstd and portdbg interpreters into a single
portable interpreter.
o Add debug/profiling support to the fast (mterp) interpreters.
o Delete old mechansim of switching between interpreters. Now, once
you choose an interpreter at startup, you stick with it.
o Allow JIT to co-exist with profiling & debugging (necessary for
first-class support of debugging with the JIT active).
o Adds single-step capability to the fast assembly interpreters without
slowing them down (and, in fact, measurably improves their performance).
o Remove old "polling for safe point" mechanism. Breakouts now achieved
via modifying base of interpreter handler table.
o Simplify interpeter control mechanism.
o Allow thread-granularity control for profiling & debugging
The primary motivation behind this change was to improve the responsiveness
of debugging and profiling and to make it easier to add new debugging and
profiling capabilities in the future. Instead of always bailing out to the
slow debug portable interpreter, we can now stay in the fast interpreter.
A nice side effect of the change is that the fast interpreters
got a healthy speed boost because we were able to replace the
polling safepoint check that involved a dozen or so instructions
with a single table-base reload. When combined with the two earlier CLs
related to this restructuring, we show a 5.6% performance improvement
using libdvm_interp.so on the Checkers benchmark relative to Honeycomb.
Change-Id: I8d37e866b3618def4e582fc73f1cf69ffe428f3c
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When seeing a trace that ends with a backward branch, exhaust all code
blocks reachable from that trace and try to identify if there exists a
non-nested loop. If the derived loop is found to be too complex or only
acyclic code is seen, revert to the original compilation mechanism to
translate a simple trace.
This CL uses the whole-method parser/dataflow analysis framework to
identify such loops. No optimization/codegen are performed yet.
Bug: 4086718
Change-Id: I19ed3ee53ea1cbda33940c533de8e9220e647156
|
| |
|
|
|
|
|
| |
For example:
chaining cell (predicted): Ljava/lang/Object;getClass
Change-Id: Ia53340baab87d6b744fc7189b141737a4a54cc42
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1) Split the original literal pool into class object literals and
constants. Elements in the class object pool have to match the specicial
values perfectly (ie no +delta space optimizations) since they might be
relocated.
2) Implement dvmJitScanAllClassPointers(void (*callback)(void *))
which is the entry routine to report all memory locations in the code cache
that contain class objects (ie class object pool and predicted chaining
cells for virtual calls).
3) Major codegen changes on how/when the class object pool are populated
and how predicted chains are patched. Before this change the compiler
thread is always in the VM_WAIT state, which won't prevent GC from
running. Since the class object pointers captured by a worker thread
are no longer guaranteed to be stable at JIT time, change various
internal data structures to capture the class descriptor/loader
tuple instead. The conversion from descriptor/loader tuple to actual
class object pointers are only performed when the thread state is
RUNNING or at GC safe point.
4) Separate the class object installation phase out of the main
dvmCompilerAssembleLIR routine so that the impact to blocking GC
requests is minimal. Add new stats to report the potential block time.
For example:
Potential GC blocked by compiler: max 46 us / avg 25 us
5) Various cleanup in the trace structure walkup code. Modified the
verbose print routine to show the class descriptor in the class literal
pool. For example:
D/dalvikvm( 1450): -------- end of chaining cells (0x007c)
D/dalvikvm( 1450): 0x44020628 (00b4): .class
(Lcom/android/unit_tests/PerformanceTests$EmptyClass;)
D/dalvikvm( 1450): 0x4402062c (00b8): .word (0xaca8d1a5)
D/dalvikvm( 1450): 0x44020630 (00bc): .word (0x401abc02)
D/dalvikvm( 1450): End
Bug: 3482956
Change-Id: I2e736b00d63adc255c33067544606b8b96b72ffc
|
| |\ |
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The invoke-object-init instruction pretends to be a regular invoke
that only knows how to call Object.<init>. As such it always takes
one argument, and if we use the /range version we can specify the
"this" register with 16 bits instead of only 4.
Bug 3486699
Change-Id: I9ee4700c6935beee1dcbaa583b57befd33641414
|
| |/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current implementation is to reconstruct the leaf Dalvik frame and
punt to the interpreter, since the amount of work involed to match
each catch block and walk through the stack frames is just not worth
JIT'ing.
Additional changes:
- Fixed a control-flow bug where a block that ends with a throw shouldn't
have a fall-through block.
- Fixed a code cache lookup bug so that method-based compilation is
guaranteed a slot in the profiling table.
- Created separate handler routines based on opcode format for the
method-based JIT.
- Renamed a few core registers that also have special meanings to the
VM or ARM architecture.
Change-Id: I429b3633f281a0e04d352ae17a1c4f4a41bab156
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The key datastructure for the interpreter is InterpState.
This change eliminates it, merging its data with the Thread structure.
Here's why:
In principio creavit Fadden Thread et InterpState. And it was good.
Thread holds thread-private state, while InterpState captures data
associated with a Dalvik interpreter activation. Because JNI calls
can result in nested interpreter invocations, we can have more than one
InterpState for each actual thread. InterpState was relatively small,
and it all worked well. It was used enough that in the Arm version
a register (rGLUE) was dedicated to it.
Then, along came the JIT guys, who saw InterpState as a convenient place
to dump all sorts of useful data that they wanted quick access to through
that dedicated register. InterpState grew and grew. In terms of
space, this wasn't a big problem - but it did mean that the initialization
cost of each interpreter activation grew as well. For applications
that do a lot of callbacks from native code into Dalvik, this is
measurable. It's also mostly useless cost because much of the JIT-related
InterpState initialization was setting up useful constants - things that
don't need to be saved and restored all the time.
The biggest problem, though, deals with thread control. When something
interesting is happening that needs all threads to be stopped (such as
GC and debugger attach), we have access to all of the Thread structures,
but we don't have access to all of the InterpState structures (which
may be buried/nested on the native stack). As a result, polling for
thread suspension is done via a one-indirection pointer chase. InterpState
itself can't hold the stop bits because we can't always find it, so
instead it holds a pointer to the global or thread-specific stop control.
Yuck.
With this change, we eliminate InterpState and merge all needed data
into Thread. Further, we replace the decidated rGLUE register with a
pointer to the Thread structure (rSELF). The small subset of state
data that needs to be saved and restored across nested interpreter
activations is collected into a record that is saved to the interpreter
frame, and restored on exit. Further, these small records are linked
together to allow tracebacks to show nested activations. Old InterpState
variables that simply contain useful constants are initialized once at
thread creation time.
This CL is large enough by itself that the new ability to streamline
suspend checks is not done here - that will happen in a future CL. Here
we just focus on consolidation.
Change-Id: Ide6b2fb85716fea454ac113f5611263a96687356
|
| |\ |
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This shifts responsibility for marking an object as "finalizable" from
object creation to object initialization. We want to make the object
finalizable when Object.<init> completes. For performance reasons we
skip the call to the Object constructor (which doesn't do anything)
and just take the opportunity to check the class flag.
Handling of clone()d object isn't quite right yet.
Also, fixed a minor glitch in stubdefs.
Bug 3342343
Change-Id: I5b7b819079e5862dc9cbd1830bb445a852dc63bf
|
| |/
|
|
|
|
|
|
|
|
| |
The debugging profile mode prints out a list of the top ten traces,
followed by recompilations. In some cases, it is possible that a trace
was requested, but did not finish compiling before the run ended. If
so, that could cause the dump to fail. This CL adds a check for null
codeAddress to detect those cases.
Change-Id: I415fd94d8fa9e270f75d5114fa5cc5d993bd6997
|
| |\ |
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Even though execute-inline is now a mandatory optimization, you can't be sure
the inline natives will be invoked that way. There's reflection and JNI, for
example, and there's the special case of String.equals that might be invoked
as Object.equals. This patch adds a regular native method corresponding to
each inline native, so that a corresponding libcore patch can drop its
implementations. (For example, despite the fact that we all believed last week
that the Java implementation of String.equals is never used, that turned out
not to be true: every HashMap lookup will have used it. This pair of patches
brings reality in line with our existing belief.)
Change-Id: I19e64c23bea83e91696206ca40ce4e3faf853040
|
| |/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The polling is expensive for now as it is done through three
instructions: ld/ld/branch. As a result, a bunch of bonus stuff has
been worked on to mitigate the extra overhead:
- Cleaned up resource flags for memory disambiguation.
- Rewrote load/store elimination and scheduler routines to hide
the ld/ld latency for GC flag. Seperate the dependency checking into
memory disambiguation part and resource conflict part.
- Allowed code motion for Dalvik/constant/non-aliasing loads to be
hoisted above branches for null/range checks.
- Created extended basic blocks following goto instructions so that
longer instruction streams can be optimized as a whole.
Without the bonus stuff, the performance dropped about ~5-10% on some
benchmarks because of the lack of headroom to hide the polling latency
in tight loops. With the bonus stuff, the performance delta is between
+/-5% with polling code generated. With the bonus stuff but disabling
polling, the new bonus stuff provides consistent performance
improvements:
CaffeineMark 3.6%
Linpack 11.1%
Scimark 9.7%
Sieve 33.0%
Checkers 6.0%
As a result, GC polling is disabled by default but can be turned on
through the -Xjitsuspendpoll flag for experimental purposes.
Change-Id: Ia81fc85de3e2b70e6cc93bc37c2b845892003cdb
|
| |\
| |
| |
| | |
into dalvik-dev
|
| | |
| |
| |
| |
| | |
Bug: 3448446
Change-Id: I98e2bbc4886443ba3c27c2963d7540fcee5790bb
|
| |/
|
|
|
|
|
|
|
|
|
|
|
|
| |
The invoke-direct-empty instruction was introduced to remove the
overhead of calling the empty Object constructor. We now need it
to do some extra work on behalf of object construction, so it's
appropriate to change the instruction name to match the role it
fills rather than the more general role it was hoped to fill.
No functional changes.
Bug 3342343
Change-Id: I65dd6a2c00c99581c9a19b16fe193b70642c8fbb
|
| |
|
|
|
|
|
|
|
| |
- Set up resource masks correctly for Thumb push/pop when LR/PC are involved.
- Preserve LR around simulated heap references under self-verification mode.
- Compact a few simple flags in ArmLIR into bit fields.
- Minor performance tuning in TEMPLATE_MEM_OP_DECODE
Change-Id: Id73edac837c5bb37dfd21f372d6fa21c238cf42a
|
| |\ |
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Closes a window in which the "interpret-only" templace could get chained
to an existing trace while the intended translation was under construction.
Note that this CL also introduces some small, but fundamental changes in trace
formation:
1. Previouosly, when an exception or other trace terminating event
occurred during trace formation, the entire trace was abandoned. With this
change, we instead end the trace at the last successful instruction.
2. We previously allowed multiple attempts (perhaps by multiple threads)
to form a trace compilation request for a dalvik PC. This was done in an
attempt to allow recovery from compiler failures. Now we enforce a new rule:
only the thread that wins the race to allocate an entry in the JitTable will
form the trace request.
3. In a (probably misguided) attempt avoid unnecessary contention, we
previously allowed work order enqueue requests to be dropped if a requester
did not aquire TableLock on first attempt (assuming that if the trace were
hot, it would be requested again). Now we block on enqueue.
Change-Id: I40ea4f1b012250219ca37d5c40c5f22cae2092f1
|
| |/
|
|
|
|
|
| |
This feature has been in the code base for several releases but has never
been enabled.
Change-Id: Ia770b03ebc90a3dc7851c0cd8ef301f9762f50db
|
| |
|
|
|
|
|
|
|
|
| |
Enhanced code cache management to accommodate both trace and method
compilations. Also implemented a hacky dispatch routine for virtual
leaf methods.
Microbenchmark showed 3x speedup in leaf method invocation.
Change-Id: I79d95b7300ba993667b3aa221c1df9c7b0583521
|
| |
|
|
|
|
| |
c++ dislikes variables named template.
Change-Id: I6aaf623b449bfdb0c88b9664c55824268992058d
|
| |
|
|
| |
Change-Id: I366df47ebb597a629cb50046320ee3a6230d1ed9
|
| |
|
|
|
|
|
|
|
|
| |
1) Thumb 'push' can handle lr and 'pop' can handle pc, so make use of them.
2) Thumb2 push was incorrectly encoded as stmia, which should be stmdb
instead.
None of the above affect the code that we currently ship.
Change-Id: I89ab46b032a3d562355c2cc3bc05fe308ba40957
|
| |\ |
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
I wanted the code to JIT a call a C function extracted so I can potentially
use it elsewhere. The functions that sometimes JIT instructions directly and
other times bail out to C can now call this, simplifying the body of the
switch. I think there's a behavioral change here with the ThumbVFP
genInlineSqrt, which previously had the wrong return value.
Tested on passion to ensure that the performance characteristics of assembler
intrinsics, C intrinsics, and library native methods haven't changed (using
the Math and Float classes).
Change-Id: Id79771a31abe3a516f403486454e9c0d9793622a
|
| |/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change builds on an earlier bccheng change that allowed JIT'd code
to avoid reverting to the debug portable interpeter when doing traceview-style
method profiling. That CL introduced a new traceview build (libdvm_traceview)
because the performance delta was too great to enable the capability for
all builds.
In this CL, we remove the libdvm_traceview build and provide full-speed
method tracing in all builds. This is done by introducing "_PROF"
versions of invoke and return templates used by the JIT. Normally, these
templates are not used, and performace in unaffected. However, when method
profiling is enabled, all existing translation are purged and new translations
are created using the _PROF templates. These templates introduce a
smallish performance penalty above and beyond the actual tracing cost, but
again are only used when tracing has been enabled.
Strictly speaking, there is a slight burden that is placed on invokes and
returns in the non-tracing case - on the order of an additional 3 or 4
cycles per invoke/return. Those operations are already heavyweight enough
that I was unable to measure the added cost in benchmarks.
Change-Id: Ic09baf4249f1e716e136a65458f4e06cea35fc18
|
| |\
| |
| |
| |
| | |
* commit '45e9a9908f8874b64294dbd3e4dcfb6b76c4b6e3':
Only generate debugging LIRs in verbose mode.
|
| | |
| |
| |
| |
| |
| |
| | |
This should reduce memory usage and JIT time a bit.
Affected opcodes: kArmPseudoSSARep and kArmPseudoDalvikByteCodeBoundary.
Change-Id: I18ce9338b8d258270df51a66f9dc98cd2d9dd0e8
|
| | |
| |
| |
| |
| |
| |
| |
| |
| | |
This enables jumbo opcodes by default, and they will get used by the
current build without modification. Support has been added for arm, x86,
and the portable interpreter. x86-atom support is on the TODO list. This
commit also includes a test for the new jumbo opcodes.
Change-Id: Ic3f1b41b51645861c5196f76aaf0e96e727ea537
|
| |\|
| |
| |
| | |
Change-Id: I56b52104f50d2e67115227e61e4b250e1116135d
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Only register entry points dispatched through [r6+#offset] in
JitToInterpEntries.
For ARM targets check the size of JitToInterpEntries explicitly to
make sure that its last entry is within 128 byte from InterpState
due to the Thumb codegen constraint.
Change-Id: I74184115cb3a3c89afc3a5fe53685671d9cb1027
|
| |\|
| |
| |
| |
| |
| |
| | |
entry point.
* commit 'af5aa1f4ce7eecc1b47a4c038cebb67d33f08f18':
Don't treat dvmJitToPatchPredictedChain as a Jit-to-Interp entry point.
|