gcc: Enable Link-Time Optimization for gcc >= 4.6

Information
Submitter:	Andreas Hansson
Repository:	gem5
Branch:	default
Bugs:
Depends On:
Reviewers
Groups:	Default
People:

Description

Changeset 9240:e47a0faa1ad0
---------------------------
gcc: Enable Link-Time Optimization for gcc >= 4.6

This patch adds Link-Time Optimization when building the fast target
using gcc >= 4.6, and adds a scons flag to disable it (-no-lto). No
check is performed to guarantee that the linker supports LTO and use
of the linker plugin, so the user has to ensure that binutils GNU ld
>= 2.21 or the gold linker is available. Typically, if gcc >= 4.6 is
available, the latter should not be a problem. Currently the LTO
option is only useful for gcc >= 4.6, due to the limited support on
clang and earlier versions of gcc. The intention is to also add
support for clang once the LTO integration matures.

The same number of jobs is used for the parallel phase of LTO as the
jobs specified on the scons command line, using the -flto=n flag that
was introduced with gcc 4.6. The gold linker also supports concurrent
and incremental linking, but this is not used at this point.

The compilation and linking time is increased by almost 50% on
average, although ARM seems to be particularly demanding with an
increase of almost 100%. Also beware when using this as gcc uses a
tremendous amount of memory and temp space in the process. You have
been warned.

After some careful consideration, and plenty discussions, the flag is
only added to the fast target, and the warning that was issued in an
earlier version of this patch is now removed. Similarly, the flag used
to enable LTO, now the default is to use it, and the flag has been
modified to disable LTO. The rationale behind this decision is that
opt is used for development, whereas fast is only used for long runs,
e.g. regressions or more elaborate experiments where the additional
compile and link time is amortized by a much larger run time.

When it comes to the return on investment, the regression seems to be
roughly 15% faster with LTO. For a bit more detail, I ran twolf on
ARM.fast, with three repeated runs, and they all finish within 42
minutes (+- 25 seconds) without LTO and 31 minutes (+- 25 seconds)
with LTO, i.e. LTO gives an impressive >25% speed-up for this case.

Without LTO (ARM.fast twolf)

real	42m37.632s
user	42m34.448s
sys	0m0.390s

real	41m51.793s
user	41m50.384s
sys	0m0.131s

real	41m45.491s
user	41m39.791s
sys	0m0.139s

With LTO (ARM.fast twolf)

real	30m33.588s
user	30m5.701s
sys	0m0.141s

real	31m27.791s
user	31m24.674s
sys	0m0.111s

real	31m25.500s
user	31m16.731s
sys	0m0.106s

Testing Done

util/regress all passing (disregarding t1000 and eio)

Issue Summary

Description	From	Last Updated	Status
I don't think I want to be nagged about this. Does it make sense just to turn it on by ...	Steve Reinhardt	July 2, 2012, 12:30 a.m.	Open

i don't think this is really a problem, but you mention it could work with debug and don't include the lto flags in the debug binary.

Andreas Hansson May 2, 2012, 10:04 p.m. (May 2, 2012, 10:04 p.m.)

Well spotted!

It seems I spoke to soon. For some reason, when adding the link flags to the debug target, some regressions fail with rather mysterious aborts, segfaults etc. Not many...but e.g. memtest for ALPHA, realview-timing for ARM. It works absolutely fine for fast opt and prof though.

Description:

	+	Changeset 9086:477273b9722c
	+
		gcc: Enable Link-Time Optimization for gcc >= 4.6

		This patch adds a scons flag to indicate that compilation and linking
~		should be done using LTO. No check is used to guarantee that the
~		linked supports LTO, so the user has to ensure that binutils GNU ld >=
~		2.21 or the gold linker is available.
	~	should be done using LTO. No check is performed to guarantee that the
	~	linker supports LTO and use of the linker plugin, so the user has to
	~	ensure that binutils GNU ld >= 2.21 or the gold linker is available.

		The same number of jobs is used for the parallel phase of LTO as the
		jobs specified on the scons command line, using the -flto=n flag that
~		was introduced with gcc 4.6.
	~	was introduced with gcc 4.6. Supposedly the gold linker also supports
	+	concurrent and incremental linking, but this is not used at this
	+	point.

		Currently the LTO option is only useful for gcc >= 4.6, due to the
		limited support on clang and earlier versions of gcc. The intention is
		to also add support for clang once the LTO integration matures. The
		use of LTO is independent of the target, i.e. debug, opt, fast and
~		pro, although opt and fast are the most likely candidates.
	~	prof, although opt and fast are the most likely candidates.
	+
	+	The compilation and linking time is increased by almost 50% on
	+	average, although ARM seems to be particularly demanding with an
	+	increase of almost 100%. Also beware when using this as gcc uses a
	+	tremendous amount of memory and temp space in the process. You have
	+	been warned.
	+
	+	When it comes to the return on investment, the regression seems to be
	+	roughly 15% faster with LTO. For a bit more detail, I ran twolf on
	+	ARM.fast, with three repeated runs, and they all finish within 42
	+	minutes (+- 25 seconds) without LTO and 31 minutes (+- 25 seconds)
	+	with LTO, i.e. LTO gives an impressive >25% speed-up for this case.
	+
	+	Without LTO (ARM.fast twolf)
	+
	+	real 42m37.632s
	+	user 42m34.448s
	+	sys 0m0.390s
	+
	+	real 41m51.793s
	+	user 41m50.384s
	+	sys 0m0.131s
	+
	+	real 41m45.491s
	+	user 41m39.791s
	+	sys 0m0.139s
	+
	+	With LTO (ARM.fast twolf)
	+
	+	real 30m33.588s
	+	user 30m5.701s
	+	sys 0m0.141s
	+
	+	real 31m27.791s
	+	user 31m24.674s
	+	sys 0m0.111s

~		The compilation and linking (wall) time is increased by almost 50% for
~		the observed cases, and simulation performance improves by roughly
~		15%, i.e. for long regression runs it really helps.
	~	real 31m25.500s
	~	user 31m16.731s
	~	sys 0m0.106s

Diff:

Revision 2 (+59 -25)

Show changes

	SConstruct
	src/SConscript

Description:

~		Changeset 9086:477273b9722c
	~	Changeset 9086:faeddb9fb678

		gcc: Enable Link-Time Optimization for gcc >= 4.6

		This patch adds a scons flag to indicate that compilation and linking
		should be done using LTO. No check is performed to guarantee that the
		linker supports LTO and use of the linker plugin, so the user has to
		ensure that binutils GNU ld >= 2.21 or the gold linker is available.

		The same number of jobs is used for the parallel phase of LTO as the
		jobs specified on the scons command line, using the -flto=n flag that
		was introduced with gcc 4.6. Supposedly the gold linker also supports
		concurrent and incremental linking, but this is not used at this
		point.

		Currently the LTO option is only useful for gcc >= 4.6, due to the
		limited support on clang and earlier versions of gcc. The intention is
		to also add support for clang once the LTO integration matures. The
		use of LTO is independent of the target, i.e. debug, opt, fast and
		prof, although opt and fast are the most likely candidates.

		The compilation and linking time is increased by almost 50% on
		average, although ARM seems to be particularly demanding with an
		increase of almost 100%. Also beware when using this as gcc uses a
		tremendous amount of memory and temp space in the process. You have
		been warned.

		When it comes to the return on investment, the regression seems to be
		roughly 15% faster with LTO. For a bit more detail, I ran twolf on
		ARM.fast, with three repeated runs, and they all finish within 42
		minutes (+- 25 seconds) without LTO and 31 minutes (+- 25 seconds)
		with LTO, i.e. LTO gives an impressive >25% speed-up for this case.

		Without LTO (ARM.fast twolf)

		real 42m37.632s
		user 42m34.448s
		sys 0m0.390s

		real 41m51.793s
		user 41m50.384s
		sys 0m0.131s

		real 41m45.491s
		user 41m39.791s
		sys 0m0.139s

		With LTO (ARM.fast twolf)

		real 30m33.588s
		user 30m5.701s
		sys 0m0.141s

		real 31m27.791s
		user 31m24.674s
		sys 0m0.111s

		real 31m25.500s
		user 31m16.731s
		sys 0m0.106s

Diff:

Revision 3 (+59 -25)

Show changes

	SConstruct
	src/SConscript

Thanks... this is a pretty impressive speedup.

You should mention in the commit message that you reworked how flags are handled in order to make it easier to create this option.  At first I was thinking "gee, all this cleanup of the arg lists should go in a separate patch", then I realized that you actually needed to do that to have an independent flag.

Andreas Hansson July 2, 2012, 12:33 a.m. (July 2, 2012, 12:33 a.m.)
```
Will do
```

SConstruct (Diff revision 3)

I don't think I want to be nagged about this.  Does it make sense just to turn it on by default for opt, fast, and prof (or maybe just fast), then have a --no-lto or --lto=false option to disable it?

Show all issues

Andreas Hansson July 2, 2012, 12:33 a.m. (July 2, 2012, 12:33 a.m.)

Given the extra time needed to link, I'm inclined to leave it as a default-off option. An intermediate step would be to base the choice on an environment variable if present...

Steve Reinhardt July 2, 2012, 12:47 a.m. (July 2, 2012, 12:47 a.m.)

That's a reasonable concern.  I'm mostly concerned about the nag message; it will be annoying to see it every time I compile if I've already decided not to use lto, particularly in a situation like debug where I almost certainly don't want it.  But that leaves the question of getting people to use lto if it's not on by default and there's no message.

One counter-argument for you is that people compiling gem5.fast have already indicated that they want as-fast-as-possible execution at the expense of everything else, so it seems reasonable to me to enable lto there by default when it applies.  By extension, opt has so far been seen as "fast with seatbelts" (in my mind anyway), so that argues that lto should be on there too.  And if you're profiling, you really want to get profile information for the fastest compilation, otherwise you could be looking at artificial bottlenecks.  So it's a slippery slope, I admit.

I'm not convinced one way or another myself, just trying to reason through the issue.

There are hybrid answers possible too, like turn it on by default only in fast and prof, remind that it's an option in opt, but say nothing in debug.  I'm not sure how easy it is to code that up in the sconscript though.

Or if you feel strongly about how you're doing it now, you can tell me I'm a wimp for getting annoyed at repetitive output messages...

Andreas Hansson July 2, 2012, 1:01 a.m. (July 2, 2012, 1:01 a.m.)

It all makes sense, and I suppose the warning will get rather annoying in the long run. Perhaps go for the hybrid as you propose. The warning message could be moved to the src/SConscript where the ccflags and ldflags are being populated.

Anyone else got any suggestions?

Ali Saidi July 2, 2012, 1:03 a.m. (July 2, 2012, 1:03 a.m.)

Both of these were my fault. The issue is that GCC 4.6, O3 lto takes 45 minutes to link; so you really have to want it. If you're running 10 hour long jobs it quickly becomes worth it, but otherwise it's a bit excessive.

Description:

~		Changeset 9086:faeddb9fb678
	~	Changeset 9240:e47a0faa1ad0

		gcc: Enable Link-Time Optimization for gcc >= 4.6

~		This patch adds a scons flag to indicate that compilation and linking
~		should be done using LTO. No check is performed to guarantee that the
~		linker supports LTO and use of the linker plugin, so the user has to
~		ensure that binutils GNU ld >= 2.21 or the gold linker is available.
	~	This patch adds Link-Time Optimization when building the fast target
	~	using gcc >= 4.6, and adds a scons flag to disable it (-no-lto). No
	~	check is performed to guarantee that the linker supports LTO and use
	~	of the linker plugin, so the user has to ensure that binutils GNU ld
	+
	+	= 2.21 or the gold linker is available. Typically, if gcc >= 4.6 is available, the latter should not be a problem. Currently the LTO option is only useful for gcc >= 4.6, due to the limited support on clang and earlier versions of gcc. The intention is to also add support for clang once the LTO integration matures.

		The same number of jobs is used for the parallel phase of LTO as the
		jobs specified on the scons command line, using the -flto=n flag that
~		was introduced with gcc 4.6. Supposedly the gold linker also supports
~		concurrent and incremental linking, but this is not used at this
	~	was introduced with gcc 4.6. The gold linker also supports concurrent
	~	and incremental linking, but this is not used at this point.
-		point.
-
-		Currently the LTO option is only useful for gcc >= 4.6, due to the
-		limited support on clang and earlier versions of gcc. The intention is
-		to also add support for clang once the LTO integration matures. The
-		use of LTO is independent of the target, i.e. debug, opt, fast and
-		prof, although opt and fast are the most likely candidates.

		The compilation and linking time is increased by almost 50% on
		average, although ARM seems to be particularly demanding with an
		increase of almost 100%. Also beware when using this as gcc uses a
		tremendous amount of memory and temp space in the process. You have
		been warned.

	+	After some careful consideration, and plenty discussions, the flag is
	+	only added to the fast target, and the warning that was issued in an
	+	earlier version of this patch is now removed. Similarly, the flag used
	+	to enable LTO, now the default is to use it, and the flag has been
	+	modified to disable LTO. The rationale behind this decision is that
	+	opt is used for development, whereas fast is only used for long runs,
	+	e.g. regressions or more elaborate experiments where the additional
	+	compile and link time is amortized by a much larger run time.
	+
		When it comes to the return on investment, the regression seems to be
		roughly 15% faster with LTO. For a bit more detail, I ran twolf on
		ARM.fast, with three repeated runs, and they all finish within 42
		minutes (+- 25 seconds) without LTO and 31 minutes (+- 25 seconds)
		with LTO, i.e. LTO gives an impressive >25% speed-up for this case.

		Without LTO (ARM.fast twolf)

		real 42m37.632s
		user 42m34.448s
		sys 0m0.390s

		real 41m51.793s
		user 41m50.384s
		sys 0m0.131s

		real 41m45.491s
		user 41m39.791s
		sys 0m0.139s

		With LTO (ARM.fast twolf)

		real 30m33.588s
		user 30m5.701s
		sys 0m0.141s

		real 31m27.791s
		user 31m24.674s
		sys 0m0.111s

		real 31m25.500s
		user 31m16.731s
		sys 0m0.106s

Diff:

Revision 4 (+33 -1)

Show changes

	SConstruct
	src/SConscript

I'm good with this... 25% perf improvement is very nice.

Ship It!

Andreas Hansson Sept. 13, 2012, 1:09 a.m. (Sept. 13, 2012, 1:09 a.m.)

Can I assume that you are also ok with the preceding ones?

http://reviews.gem5.org/r/1408/

http://reviews.gem5.org/r/1409/

You have a pending review.

Review Board 2.0.15

This change has been marked as submitted.

Screenshots

Files

Issue Summary

Description:

Changeset 9086:477273b9722c

Diff:

Description:

Changeset 9086:477273b9722c

Changeset 9086:faeddb9fb678

Diff:

Description:

Changeset 9086:faeddb9fb678

Changeset 9240:e47a0faa1ad0

Diff:

Status: Closed (submitted)