tests: Add test infrastructure as a Python module

Information
Submitter:	Andreas Sandberg
Repository:	gem5
Branch:	default
Bugs:
Depends On:
Reviewers
Groups:	Default
People:

Description

Changeset 11472:23a195434229
---------------------------
tests: Add test infrastructure as a Python module

Implement gem5's test infrastructure as a Python module and a run
script that can be used without scons. The new implementation has
several features that were lacking from the previous test
infrastructure such as support for multiple output formats, automatic
runtime tracking, and better support for being run in a cluster
environment.

Tests consist of one or more steps (TestUnit). Units are run in two
stages, the first a run stage and then a verify stage. Units in the
verify stage are automatically skipped if any unit run stage wasn't
run. The library currently contains TestUnit implementations that run
gem5, diff stat files, and diff output files.

Existing tests are implemented by the ClassicTest class and "just
work". New tests can that don't rely on the old "run gem5 once and
diff output" strategy can be implemented by subclassing the Test base
class or ClassicTest.

Test results can be output in multiple formats. The module currently
supports JUnit, text (short and verbose), and Python's pickle
format. JUnit output allows CI systems to automatically get more
information about test failures. The pickled output contains all state
necessary to reconstruct a tests results object and is mainly intended
for the build system and CI systems.

Since many JUnit parsers parsers assume that test suite names look
like Java package names. We currently output path-like names with
slashes separating components. Test names are translated according to
these rules:

  * '.' -> '-"
  * '/' -> '.'

The test tool, tests.py, supports the following features:

  * Test listing. Example: ./tests.py list arm/quick

  * Running tests. Example:
    ./tests.py run -o output.pickle --format pickle \
        ../build/ARM/gem5.opt \
        quick/se/00.hello/arm/linux/simple-timing

  * Displaying pickled results. Example:
    ./tests.py show --format summary *.pickle

Change-Id: I527164bd791237aacfc65e7d7c0b67b695c5d17c
Signed-off-by: Andreas Sandberg <andreas.sandberg@arm.com>
Reviewed-by: Curtis Dunham <curtis.dunham@arm.com>

Testing Done

Issue Summary

Description	From	Last Updated	Status
Can you add a comment description here that describes that this is an example test for the ProcessHelper?	Joel Hestness	May 3, 2016, 7:06 a.m.	Resolved
In general, I've found the tests directory naming structure to be confusing and underdocumented. Can we add some clear comments ...	Joel Hestness	May 3, 2016, 10:31 a.m.	Dropped
I think this abstract filter is going to be very useful... Unfortunately, I tried running this script and changing this ...	Joel Hestness	May 3, 2016, 10:31 a.m.	Resolved
Can you add a description here that is similar to the patch description? I played around with this patch, and ...	Joel Hestness	May 3, 2016, 10:31 a.m.	Resolved

Overall, this looks pretty solid - much more like a "test harness" than the existing regressions. Thanks!

Given that this change is already pretty significant, are there other changes we should introduce or plan to make? For example, below I suggest clarifying test naming conventions.

Also, I can see how this change might make it easier for users to add and run their own tests, but only within the existing cumbersome tests naming/directory structure. Can we make it a little clearer how to add tests with a different naming/directory structure? Ultimately, something I'd really like is the ability to add a named group of tests that validate subsets of functionality (e.g. current "regressions" are such a group, or I'd like to have a group of tests for both cache coherence and memory consistency for different ISAs, CPU and/or GPU, and different Ruby coherence protocols).

Andreas Sandberg May 3, 2016, 10:43 a.m. (May 3, 2016, 10:43 a.m.)

Thanks for a very amibtious review! :)

I like your ideas of cleaning up the test naming. It definitely makes sense and is something I'd like to do at some point. We also need to sit down and about what we test and how (e.g., not being able to redistribute test binaries is a big issue), but that is an orthogonal question which I'd like to discuss on the dev list in the not too distant future.

As you alread noticed it's already quite a large piece of code, so I'd like to avoid making more changes to the tests (like reorganising the test strucutre) in this commit. One of the goals when writing this was that after applying RB3462, there should be no difference to most users of the regression system.

tests/testing/helpers.py (Diff revision 1)

Can you add a comment description here that describes that this is an example test for the ProcessHelper?

Show all issues

tests/testing/tests.py (Diff revision 1)

In general, I've found the tests directory naming structure to be confusing and underdocumented. Can we add some clear comments here and make these test character names more precise?

For instance, "category" is either "quick" or "long"; it's confusing to call these test categories, since they just appear to describe the duration of the test. Maybe "duration" instead? Can you document roughly what "quick" and "long" should be?

The "mode" name is ambiguous, since gem5 code uses "mode" to describe numerous different things (e.g. full-system vs. syscall emulation mode, memory modes, system/user modes, etc.). Maybe "syscall mode" instead?

I've always found it strange to just use "benchmark" as well, since many tests come from different suites... I'm not sure what would be better here.

Finally, the name "config" should reflect that it sufficiently describes the simulated system. In general, it includes a platform name (e.g. tsunami for ALPHA or pc for X86), whether there are multiple systems, the type of the CPU cores, and the memory mode. Is there a reason that we separate out the ISA or even the OS? Maybe "system config" instead? Can you add comments with this detail?

Show all issues

Andreas Sandberg May 3, 2016, 10:43 a.m. (May 3, 2016, 10:43 a.m.)

I'd like to avoid changing this for now. As you already noted, this is already quite a large change. The existing test infrastructure (e.g., run.py) currently uses the same terminology for the contents of the test tuple. I don't like the way the tuple is structured, but I'd like to save that for a different patch. However, I did change the named tuple to use 'workload' instead of 'benchmark' as that's much more descriptive.

tests/tests.py (Diff revision 1)

I think this abstract filter is going to be very useful... Unfortunately, I tried running this script and changing this (these) argument, but I'm still not sure I understand how it works.

Please make sure to add notes about this argument in the comments.

Show all issues

tests/tests.py (Diff revision 1)

Can you add a description here that is similar to the patch description? I played around with this patch, and there were some things I couldn't figure out just from the help messages (e.g. see note above).

Show all issues

Description:

~		Changeset 11472:213bea6414bc
	~	Changeset 11472:23a195434229

		tests: Add test infrastructure as a Python module

		Implement gem5's test infrastructure as a Python module and a run
		script that can be used without scons. The new implementation has
		several features that were lacking from the previous test
		infrastructure such as support for multiple output formats, automatic
		runtime tracking, and better support for being run in a cluster
		environment.

		Tests consist of one or more steps (TestUnit). Units are run in two
		stages, the first a run stage and then a verify stage. Units in the
		verify stage are automatically skipped if any unit run stage wasn't
		run. The library currently contains TestUnit implementations that run
		gem5, diff stat files, and diff output files.

		Existing tests are implemented by the ClassicTest class and "just
		work". New tests can that don't rely on the old "run gem5 once and
		diff output" strategy can be implemented by subclassing the Test base
		class or ClassicTest.

		Test results can be output in multiple formats. The module currently
		supports JUnit, text (short and verbose), and Python's pickle
		format. JUnit output allows CI systems to automatically get more
		information about test failures. The pickled output contains all state
		necessary to reconstruct a tests results object and is mainly intended
		for the build system and CI systems.

		Since many JUnit parsers parsers assume that test suite names look
		like Java package names. We currently output path-like names with
		slashes separating components. Test names are translated according to
		these rules:

		'.' -> '-"
		'/' -> '.'

		The test tool, tests.py, supports the following features:

		Test listing. Example: ./tests.py list arm/quick
		Running tests. Example: ./tests.py run -o output.pickle --format pickle \ ../build/ARM/gem5.opt \ quick/se/00.hello/arm/linux/simple-timing
		Displaying pickled results. Example: ./tests.py show --format summary *.pickle

		Change-Id: I527164bd791237aacfc65e7d7c0b67b695c5d17c
		Signed-off-by: Andreas Sandberg andreas.sandberg@arm.com
		Reviewed-by: Curtis Dunham curtis.dunham@arm.com

Diff:

Revision 2 (+1341)

Show changes

	tests/tests.py
	tests/testing/__init__.py
	tests/testing/helpers.py
	tests/testing/results.py
	tests/testing/tests.py
	tests/testing/units.py

Thanks for the updates. LGTM.

If you're willing, I still think it would be good to add comments about what might need to be changed in these files in order to add new tests.

Andreas Sandberg May 6, 2016, 7:40 a.m. (May 6, 2016, 7:40 a.m.)

Will do. The way the configuration files are added isn't exactly obvious.

Hi Andreas, I had a couple questions/clarifications about this patch. The idea is that this script will run the test/test suites in an ordered fashion but doesn't contain any infrastructure to launch jobs in a distributed environment, correct? You mentioned that these work in a CI environment and can be reproduced using only the .pickle files, but it seems that actually job submission is left to users.
Also, the reporting is via output files that could be scanned by some other process, but it doesn't actively push results, do I understand that correctly?
Lastly, it seems that this is intended to be used once a build or builds have finished, but wouldn't be responsible for building or updating builds, right?
It looks like a nice improvement and I just want to be sure I understand the scope and capabilities correctly.

Andreas Sandberg May 10, 2016, 2:21 a.m. (May 10, 2016, 2:21 a.m.)

Hi Joe,

Launching test: You're right that the script isn't intended to launch things in a distributed way. The script supports running multiple tests in series (it's not very useful in practice, so running multiple tests should probably by dropped). What I typically do to launch tests is something along these lines (see RB3462 for the scons side):

# Build phase, usually on a separate node
scons build/ARM/gem5.opt
scons build/ARM/tests/opt/all.list

# Test phase. Tests usually run as individual jobs in a cluster.
for F in `cat build/ARM/tests/opt/all.list`; do
    mkdir m5tests/$F
    submit_cluster_job ./tests/tests.py run --format pickle -o m5tests/$F/status.pickle  build/ARM/gem5.opt $F
done

I'm not sure what you mean by pushing results. Is that related to the CI system integration? I normally do that by merging pickle files into one JUnit file:

./tests/tests.py ./tests/tests.py show --format junit -o junit.xml `find m5tests/ -name status.pickle`

The pickle file is then read and formatted by the CI system (which obviously must be configured to pick up the JUnit output).

In terms of integrating with the build environment, this script doesn't try to build anything. You point it to a gem5 binary you want to test and tell it which tests to run. There is a separate scons (RB3462) integration that supports building as well. In practice, that just implements the old behaviour using the new infrastructure.

I am considering adding support for updating test references using the test script. This would be exteremely useful in our CI environment since it stores a tar ball of the raw test output. This tar ball could be used to update stats if the nightly runs show that they differ.

I hope that answers all of your questions.

Joe Gross May 18, 2016, 1:29 p.m. (May 18, 2016, 1:29 p.m.)

Hi Andreas, thanks for your detailed reply. We're currently looking at leveraging your tester infrastructure for our internal tests which is why I've had a few questions for you. We'd like to be able to distribute tests to our job queuing system. Do you think it's easy to optionally run all the tests in parallel in this way? Also we were looking at how we can test building various configurations of gem5 and benchmark binaries and running combinations, do you think this is possible also?

Andreas Sandberg May 26, 2016, 1:58 a.m. (May 26, 2016, 1:58 a.m.)

It should be straight forward to do that. That's what we're doing internally. The (all|quick|long).list scons targets are there to make test selection straight forward and ensures you get the exact same tests as in a normal scons run (calling tests.py list ... requires the protocol/gpu parameters to be set manually).

The run command blocks until the test completes, so it should be straight forward to execute in your cluster environment. There are no dependencies between jobs and the script ensures that output files end up in different directories, so it should be safe to execue jobs in parallel.

Except for the gem5 binary under test, there are no dependencies on build artifacts in the gem5 tree. We typically build and test in different steps in our CI system. In practice, that means that our test (submission) jobs pull in binaries from the build jobs and check out a clean gem5 source tree (same revision as used in the build) that is used for test dependencies (scripts, binaries, etc.).

What you need to keep in mind though is that the run and format commands will won't signal their status using exit codes at the moment. This means that the only way to detect failed tests when run like this is by parsing the junit file. I'm planning to fix this though.

You have a pending review.

Review Board 2.0.15

This change has been marked as submitted.

Screenshots

Files

Issue Summary

Description:

Changeset 11472:213bea6414bc

Changeset 11472:23a195434229

Diff:

Status: Closed (submitted)