Author Topic: Performance Comparison WeiDU 243 - 245  (Read 973 times)

Offline StefanO

  • Planewalker
  • *****
  • Posts: 8
Performance Comparison WeiDU 243 - 245
« on: May 14, 2018, 01:19:00 AM »
I ran a couple of performance tests on a Mac.

Hardware: Up-To-Date iMac, SSD
BG2EE: fresh installed v2.3 with no other mods installed
stratagems v30 + hotfixes (only the first initialisation was measured)

32bit WeiDU v243 : 1.5m
32bit WeiDU v244 : 14.5m
32bit WeiDU v245 : 1.5m
64bit WeiDU v245 : 8.5m

Something is seriously wrong with the 244 binary. The 32 bit v245 is as fast as the v243 binary.

Edit by Wisp: I took the liberty of changing the title.
« Last Edit: May 18, 2018, 11:53:50 AM by Wisp »

Offline Argent77

  • Planewalker
  • *****
  • Posts: 165
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #1 on: May 14, 2018, 03:56:12 AM »
As I have already explained in this topic, the macOS binary for WeiDU 244 were built with experimental bytecode emulation because I upgraded my macOS machine recently. It was the only viable option to provide a binary that was compatible for both modern 64-bit system and older 32-bit systems, but resulted in degraded performance.

For WeiDU 245 I was able to temporarily reactivate my old macOS machine to build the 32-bit binary (without performance issues) and the new machine for building the 64-bit binary. I will probably have to stop providing 32-bit versions altogether when WeiDU's codebase is upgraded to conform to more recent OCaml compilers.

Offline StefanO

  • Planewalker
  • *****
  • Posts: 8
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #2 on: May 14, 2018, 04:11:21 AM »
... experimental bytecode emuation ...

Well, that performance mystery is explained.

Thank you very much for your continuos MacOS support.

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 986
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #3 on: May 14, 2018, 11:06:28 AM »
I would be more concerned over that >5x difference between 32-bit and 64-bit, given that Apple have stated that they'll be gradually removing multiarch-support and that all programs should be moving to 64-bit binaries. That said, I suspect the performance profiles are broadly similar regardless of OS, so we're pretty much all in the same boat. You might want to test some other mods. SCS may not be representative of the larger corpus.

Offline Argent77

  • Planewalker
  • *****
  • Posts: 165
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #4 on: May 14, 2018, 11:30:32 AM »
I can't speak for the macOS version, but testing the 64-bit WeiDU binary for Windows revealed performance issues when mods had to perform a great number of in-memory operations. It was especially noticeable with SCS, which performed about 20-25% worse than the 32-bit version, and to a lesser degree with EET and Wheels of Prophecy. The majority of mods showed no difference however, or were even faster with the 64-bit binary.
« Last Edit: May 14, 2018, 11:31:27 AM by Argent77 »

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 986
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #5 on: May 14, 2018, 12:21:33 PM »
Yeah, I get about the same difference for SCS on Windows. StefanO, would you mind verifying your results? How is the time measured? By WeiDU in the debug file or some other way?

Offline Argent77

  • Planewalker
  • *****
  • Posts: 165
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #6 on: May 14, 2018, 12:48:51 PM »
I fear StefanO's results are more or less correct. I've made a test myself on my macOS virtual machine and get roughly the same results. I'll play a bit with the WeiDU build settings on macOS. Maybe OCaml included some bits of bytecode emulation into the binary, which could explain the performance difference.

Offline Argent77

  • Planewalker
  • *****
  • Posts: 165
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #7 on: May 14, 2018, 02:51:22 PM »
I didn't have any luck improving performance of 64-bit macOS binaries. All mods will be affected, although not as heavily as SCS. In average installation time will increase by factor 1.5 to 2.5 compared to 32-bit binary.

It might still be a local issue caused by my current OCaml compiler version (which I had to compile from source myself). If the WeiDU codebase can be upgraded to build on more recent OCaml releases I could replace my current compiler by an official package, which may (or may not) work better.

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 986
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #8 on: May 15, 2018, 01:56:51 PM »
This prompted me to compare performance of the Linux builds I do (for the first time ever) and I see a similar difference between 32-bit and 64-bit. However, it seems to be a toolchain thing, as the 32-bit version is compiled on 32-bit Debian Jessie (OCaml 4.01 and old toolchain) while the 64-bit version is compiled on 64-bit Fedora 27 (OCaml 4.05 and recent-ish toolchain); if I compile a 32-bit WeiDU on 32-bit Fedora 27, it's a lot slower than the version compiled on Debian even though they are both 32-bit. Both Windows versions are compiled with modern-ish OCamls and toolchains (right?) and I'm guessing so is the 64-bit Mac version, while the 32-bit Mac version is probably from and old-ish toolchain (right?), same as the performant 32-bit Linux version. Next, I'll be setting up a 64-bit Debian Jessie (if I can) to attempt to see if that produces a performant 64-bit WeiDU. If so, I guess I'll have to check with the OCaml people if they know what's going on.

Edit: am I and my old Sandy Bridge Windows system the only ones to get ~500-600 seconds to install Initialise with Windows WeiDU? Like StefanO, I get ~90 seconds with the good 32-bit Linux WeiDU.
« Last Edit: May 15, 2018, 02:21:47 PM by Wisp »

Offline Argent77

  • Planewalker
  • *****
  • Posts: 165
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #9 on: May 15, 2018, 03:16:03 PM »
You could be right. The 32-bit macOS binary is compiled by OCaml 3.12.1, while 64-bit binary for macOS and Windows were both compiled by OCaml 4.05.

Your Windows timings aren't too far off either. I'm also getting timings of around 400-500 seconds for SCS init on Windows (with both 32/64 bit variants). Out of curiosity I have installed the same mod component with WeiDU 237 that came with the original SCS package, and it finished installation after about 50 seconds! There is definitely something not right with more recent OCaml versions.

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 986
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #10 on: May 15, 2018, 05:01:01 PM »
Yeah, I get about 80 seconds with v237, as well as with v245 compiled with OCaml 4.01.00 and 4.02.3. I guess I'll by trying out the different OCaml versions between 4.02 and 4.05 to see if it's one of them that's doing it. It could be a gcc thing or something else, too, although I suppose gcc and the rest are unlikely compared to OCaml's compiler.
« Last Edit: May 15, 2018, 05:17:54 PM by Wisp »

Offline subtledoctor

  • Planewalker
  • *****
  • Posts: 96
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #11 on: May 16, 2018, 01:47:30 PM »
EDIT - sorry, wrong thread, nothing to see here...

Offline StefanO

  • Planewalker
  • *****
  • Posts: 8
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #12 on: May 16, 2018, 01:54:45 PM »
The 64 bit v245 binary is now just as fast as the 32 bit binary. Thanks.

What OCaml compiler version did you use?

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 986
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #13 on: May 16, 2018, 03:38:25 PM »
I compile with 4.02 or 4.01. Argent77 said 4.03 exhibited a smaller performance penalty (than 4.05, which is what we used to build the affected binaries) and that it went off the cliff after that. I'll confirm and bisect.
« Last Edit: May 16, 2018, 03:47:33 PM by Wisp »

Offline Argent77

  • Planewalker
  • *****
  • Posts: 165
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #14 on: May 16, 2018, 03:58:20 PM »
64-bit macOS and Windows binaries were built with OCaml 4.03. 4.02 and earlier may be slightly faster (less than 10%), but I have noticed it only by inspecting the timings from the SCS logs.

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 986
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #15 on: May 16, 2018, 05:16:30 PM »
Heh, on 64-bit Linux, 4.03.0 reduced install time by about 5 % compared to 4.02.3 (obviously this does not mean it can't be different on other systems). I'll continue with later versions tomorrow.
« Last Edit: May 16, 2018, 05:26:01 PM by Wisp »

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 986
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #16 on: May 17, 2018, 04:07:01 PM »
Seems to be caused by a change to the implementation of hash tables. From the change log of 4.04.0:
Quote
Optimize Hashtbl by using in-place updates of its internal bucket lists. All operations run in constant stack size and are usually faster, except Hashtbl.copy which can be much slower
(And the bisect stops at seemingly related commits.) WeiDU does a number of Hashtbl.copys, notably any time variable scopes change (e.g., functions). OCaml probably would not accept this as a regression, given that the change seems to be intentional.
« Last Edit: May 17, 2018, 04:08:26 PM by Wisp »

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 986
Re: Performance Comparison WeiDU 243 - 245 on a Mac
« Reply #17 on: May 18, 2018, 11:48:19 AM »
Yeah, so these are the (most significant) timings of Initialise on my machine with WeiDU 245+4.03.0:
Code: [Select]
COPY                             1.209
READ_*                           4.036
eval_pe                          8.723
process_patch2                  18.244
function overhead               70.090
TOTAL                          104.303
Pretty grim, but whatever. Then we try 245+4.04.0:
Code: [Select]
COPY                             1.894
READ_*                           6.195
eval_pe                         18.201
process_patch2                  32.231
function overhead              643.688
TOTAL                          704.793
Yeah...
Breaking out the Hashtbl.copy that is used to set up the in-function variable environment (still 245+4.04.0):
Code: [Select]
COPY                             1.849
READ_*                           6.232
eval_pe                         18.207
process_patch2                  31.968
function overhead              125.434
copying hash tables            555.494
TOTAL                          741.566
So we've found the culprit.

It seems like a compelling case could be made for staying on 4.03.0 for the time being. You could possibly work around the issue with Hashtbl.copy by changing functions to enter into an empty variable scope, rather than one copied from the function's parent scope, but it'd be a breaking change, so it'd have to be opt-in (say, TP2 flag). But there is also the matter of 245+4.04.0 being slower overall than 245+4.03.0, not just on function overhead, but stuff like eval_pe and process_patch2, though maybe that time is from one or more of the other Hashtbl.copies found in WeiDU.

Offline Argent77

  • Planewalker
  • *****
  • Posts: 165
Re: Performance Comparison WeiDU 243 - 245
« Reply #18 on: May 18, 2018, 12:40:01 PM »
I'd rather not see WeiDU internals made visible to the user/modder (via TP2 flag or otherwise). It would be a hack at best, and difficult to get rid of after fixing the root cause. Is there any way to replace the Hashtbl.copy calls (without too much effort)?

aqrit

  • Guest
Re: Performance Comparison WeiDU 243 - 245
« Reply #19 on: May 19, 2018, 11:52:27 AM »
Disclaimer: Not knowning anything about anything...
I wonder if `Hashtbl.randomize` would help here?
This behavior reminds me of an "accidentally quadratic" issue that rust had when resizing hash tables.

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 986
Re: Performance Comparison WeiDU 243 - 245
« Reply #20 on: May 21, 2018, 10:46:38 AM »
I'd rather not see WeiDU internals made visible to the user/modder (via TP2 flag or otherwise). It would be a hack at best, and difficult to get rid of after fixing the root cause. Is there any way to replace the Hashtbl.copy calls (without too much effort)?
I don't see it as making the internals visible. It's (probably) an optimisation that breaks current functionality. Creating a new hash table is (probably) always going to be faster than copying an existing one.

I could also lift the implementation of Hashtbl from OCaml 4.03.0 and include it in WeiDU's source tree. It seems to work well; performance of WeiDU+4.06.0 does not differ significantly from that of plain +4.03.0.

I'm trying out some ideas for implementations that don't use Hashtbl.

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 986
Re: Performance Comparison WeiDU 243 - 245
« Reply #21 on: May 21, 2018, 10:51:39 AM »
Disclaimer: Not knowning anything about anything...
I wonder if `Hashtbl.randomize` would help here?
This behavior reminds me of an "accidentally quadratic" issue that rust had when resizing hash tables.
Hashtbl is implemented on top of Array, which is itself largely implemented in C. Up until 4.04.0, Hashtbl was an array of primitives and copying one was as simple as copying the array (which is probably just a bit of C for allocating and copying memory). After that, the implementation of Hashtbl was changed to be an array of records, and to copy it, you need to walk the array and copy each record, in addition to copying the array structure itself. The latter implementation is simply more work.

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 986
Re: Performance Comparison WeiDU 243 - 245
« Reply #22 on: May 22, 2018, 05:25:06 PM »
I'm trying out some ideas for implementations that don't use Hashtbl.
Well, with the way LOCAL variables and macros work, any effort is more or less doomed to re-implement Hashtbl, so I'll just use the one from 4.03.0.

 

With Quick-Reply you can write a post when viewing a topic without loading a new page. You can still use bulletin board code and smileys as you would in a normal post.

Name: Email:
Verification:
Type the letters shown in the picture
Listen to the letters / Request another image
Type the letters shown in the picture:
What color is grass?:
What is the seventh word in this sentence?:
What is five minus two (use the full word)?: