Pysynphot Commissioning Report
==========================

Goals and methodology
-------------------------------

The purpose of this commissioning process was to demonstrate to the
Instruments branch, and by extension other potential pysynphot users,
that the answers produced by pysynphot are at least as accurate as
those produced by synphot.

Because synphot has known deficiencies, and pysynphot was developed in
part to address those deficiencies, we began by comparing synphot and
pysynphot results, but examined cases that failed this comparison to
determine whether the differences could be explained by known
differences in the two packages. If so, the outcome was deemed
acceptable; if not, further investigation was undertaken to determine
which package was less correct, producing either a bug report and
corresponding corrective action for pysynphot, or a bug report and no
further action for synphot.


Test software
-------------

Software was developed to execute the synphot and pysynphot commands
and perform the comparisons. This software evolved over time as our
comparisons were refined, and as failed tests were traced to bugs in
the test software. The details of this effort are not included in this
report, but are recorded in the change log for the test software,
located in the subversion repository with the pysynphot code under
test/commissioning. The test sets are also stored in this repository.

Test sets
-------------

The largest test set was derived from the existing set of ETC
regression tests for ACS, NICMOS, STIS, and WFC3, as part of the
preparation for moving these ETCs to use pysynphot. (No similar set
was derived for COS because its ETC was already using pysynphot and
its regression tests had already passed.) Each ETC test case resulted
in multiple (py)synphot calls; after removing obvious duplicates in
function/obsmode/spectrum space, these became the ETC-based test set.

However, this test set was overwhelmingly large, and a smaller set was
eventually produced by selecting a representative subset for
review. 

Additional smaller test sets were defined:
   - from first principles
   - from cases with reported problems identified by INS
   - for functionality that is not exercised by the ETC, including
   more general renormalization and calcphot.effstim functionality
   - from the existing synphot regression test suite (TBD) and
   cookbook examples in the synphot manual (TBD)


Table of tests run as a function of testcase type:

              |calcphot|calcspec|countrate|specsourcerate|thermback|
Array tests:  |        |        |         |              |         |
--------------|--------|--------|---------|--------------|---------|
testthru      |   X    |        |   X     |     X        |         |
--------------|--------|--------|---------|--------------|---------|
testcrspec    |   X    |    X   |   X     |     X        |   X     |
--------------|--------|--------|---------|--------------|---------|
testcrphotlam |        |        |   X     |     X        |         |
--------------|--------|--------|---------|--------------|---------|
testcrcounts  |        |        |   X     |     X        |         |
--------------|--------|--------|---------|--------------|---------|
              |        |        |         |              |         |
Scalar tests: |        |        |         |              |         |
--------------|--------|--------|---------|--------------|---------|
testcountrate |        |        |   X     |     X        |         |
--------------|--------|--------|---------|--------------|---------|
testefflam    |   X    |        |   X     |     X        |         |
--------------|--------|--------|---------|--------------|---------|
testthermback |        |        |         |              |   X     |
--------------|--------|--------|---------|--------------|---------|

The final test set included 15701 tests. Of these, 1456 constituted
the subset that received more careful scrutiny, as did all
testcountrate and testthru tests. Additionally, any 'extreme" failures
in a given run were reviewed.

Table of test breakdown:

Total tests     15701
Repr. subset     1456
testcountrate    1769
testthru         3218


Comparing quantities
----------------------------

For each test case (synphot function/obsmode/spectrum), several
quantities were compared, depending on the synphot functionality being
tested. Initial or intermediate quantities were compared as well as
final quantities; for example, a test case involving the countrate
functionality generated tests that compared not only the final scalar
count rate, but the throughput for the specified obsmode, the spectrum
alone, and the spectrum-throughput combination, both in counts and in
photlam (the flux unit used internally by both synphot and
pysynphot). This procedure ensured that any problems could be quickly
localized and more easily investigated.

After the initial testing period, the review procedure was modified to
investigate only cases for which the throughput or count rate
comparisons failed, and review the other quantities only as part of an
ongoing investigation. This was necessary in order to reduce the
amount of work to better match available resources. (In progress.)

While comparing scalar quantities, such as the count rate and
effective wavelength, was relatively straightforward, writing tests to
compare array results from synphot and pysynphot was difficult. The
problem had two elements:

  - produce arrays to be compared on the same waveset
  - devise a comparison that was reasonably insensitive to numerical noise

   Producing arrays on the same waveset
   -----------------------------------------------------

Because pysynphot was deliberately designed to correct some of synphot's
deficiencies in waveset handling, it is clear that the spectral arrays
produced by the two software packages will be different, unless
special steps are taken to force them onto the same waveset.

testcrspec:
  This test uses the synphot.countrate task, with a box(R/2,R) where R
  is the full wavelength range over which the source spectrum is defined,
  to simulate the obsmode to produce the source spectrum unconvolved with the
  instrumental obsmode. The task uses a wavecat file to select the
  wavelength table used based on the obsmode. For this test, a special
  purpose wavecat was produced for each run, so that the wavelength
  table in the pysynphot-computed spectrum would be used.

testthru:
  This test samples the pysynphot-computed throughput array on the
  wavelength table used for the synphot-computed throughput.

testcrcounts, testcrphotlam:
  For these tests, the wavelength table from the synphot-computed
  spectrum is used as the binned wavelength set when constructing
  the pysynphot Observation.


Performing the array comparisons
----------------------------------------------

Constructing a meaningful comparison of array numerical data that will
be sensitive to significant differences, but insensitive to
insignificant differences, is a difficult task. This was the most
volatile element of the commissioning test software, as we
experimented with various approaches; details can be seen in the
revision log. This document describes the final approach used.

 - After confirming that the arrays are the same size, exclude the two
   endpoints on either end of the array from the comparison. This step
   was introduced when it was observed that the endpoints were often
   significantly discrepant due to a difference in waveset handling.

 - For each array, identify its significant elements by selecting
   elements that were greater than SIGTHRESH*the array maximum. The
   value of SIGTHRESH was tunable for test runs but is typically on
   the order of 0.01. If the two arrays have different sets of
   significant elements, a flag is set, and the test continues using
   the set of elements that were significant in the SYNPHOT array.

 - From the set of significant elements, further select the set of
   elements for which the synphot array is nonzero. Then, using this
   subset of elements, construct the discrepancy array (test-ref)/ref.

 - If any elements in the discrepancy array exceed the threshold
   THRESH (also tunable, typically set to 0.01), the test fails.

 - If more than SUPERTHRESH (typically 20%) of the elements in the
   discrepancy array exceed the threshold, the test is considered an
   "extreme failure".

 - The min, max, mean, and std of the discrepancy array were also
   reported, as were the number of 5-sigma outliers.


Identified synphot errors
--------------------------------

The following problems with synphot were either previously known, or
were identified during this commissioning process. Tests exhibiting
these problems were either deemed acceptable, or reworked to avoid the
failure so that a meaningful comparison could be performed:

- Poor performance around very narrow lines or very steep gradients
 (acceptable discrepancy)
- Spectrum differences shortward of 900A due to intersection with Vega 
 waveset (acceptable discrepancy --> modify test to avoid failing)
- Spectrum differences because synphot drops any wavelength bins
 outside the range defined by the renormalization bandpass (as above:TBD)
- Poor performance in renormalization around suddenly sparse
 wavelength tables (rework tests)
- Bug in calcspec not present in countrate (reworked tests)



Identified and repaired pysynphot errors
-----------------------------------------------------

The following problems with pysynphot were identified and fixed during
this commissioning process:

- Incorrect calculation of interpolated spectral elements (#134)
- Incorrect tapering behavior (#149)
- FITS files written to single precision by default (#146)
- Incorrect parsing of obsmode keywords that look like numbers (#125)
- Incorrect parsing of band() arguments (#126)
- Implement gal3 extinction law (#127/123) 
- Improve behavior in case of partial overlap (#157)
- Handle partial overlap cases in the etc.py interface (#158)
- Performance issues when simulating an Observation (TBD #130)
