hist - histogram csv output
I am concerned about results returned from 'hist'.
For example; one of my images (32-bit file: core_base=0.0 core_multiplier=1.0) the following is returned from both stats and hist:
Std Deviation: 4.02613
I found the single pixel value (in qview) that is the max dn value of 2040.9; but this value is not reflected in the csv histogram table. The following is the last line of the histogram:
DN Pixels CumulativePixels Percent CumulativePercent
57.3557 4 266425 0.00150136 100
Am I misunderstanding something here?
Steps to reproduce:
I am able to reproduce the results in isis3nightly, isis3astro on astrovm1 and horace.
#1 Updated by Steven Lambright over 7 years ago
I think the program "hist" is not reporting absolute values for Minimum and Maximum all of the time. It is reporting the better of the absolute min/max or the Chebyshev min/max - this is because outliers can strongly influence a histogram and bias the results. Admittedly the outputs are misleading and the documentation needs fixed.
Can you provide a cube so that I can look more into this problem and potential solutions?
#2 Updated by Laszlo Kestay over 7 years ago
I agree that the output is misleading, but the solution is not fixing the documentation but to produce the output that is scientifically useful. I cannot think of a scientific problem where the user wants the Chebyshev min/max when asking for a histogram. Yes, outliers strongly influence the results - hence they are important to include if one wants meaningful scientific results! The only data that should be automatically excluded are the invalid pixels.
#3 Updated by Jeff Anderson over 7 years ago
Tammy and Laz,
Let me see if I can explain the situation.
The Min/Max reported are truly the full range of the image data. Somehow we must represent a histogram of the data using a discrete number of bins. If by default we use the full range, the histogram would typically be underbinned. For example, say the user specified 255 bins (usual for histograms). Nearly 248 bins would have a count of zero. Only 7 bins would contain any data (57/2040 = 3%). Essentially, the histogram would be mashed into a few bins. So the program tries to help the user by picking a min/max for displaying the histogram. In fact, this is the same algorithm qview uses to try to pick a min/max for displaying the image. It computes the Chebyshev min/max at 14 standard deviations (avg plus-or-minus 14 standard deviations) which guarantees 99.9% of the data is encompassed in the histogram. For the histogram display minimum it uses the Maximum(real image min,Chebyshev Min) and for the maximum it uses the Minimum(real image max, Chevyshev Max). This helps to make sure the data is spread nicely though out the histogram. Of course this isn't always what the user/scientist wants. Fortunately the "hist" program has the option to specify the min/max. The user simply must enter the real image min/max into the program and you will see the underbinned data. As Tammy noticed, data outside the min/max histogram range is not presented in the table. I don't believe we want to change the fundamental operation of the algorithm because it would ripple through the entire ISIS system. The data outside the min/max is included in the histogram counts.
However here are a few suggestions:
We could out the bin range instead of the bin center in the output file. For the first and last bins we could keep track of the count of points below/above the histogram display min/max. The only problem is the bin size would be different for these two bins (compared to other bins) so users would need to carefully plot and analysis there data
A second option would be to provide LDS and HDS (low display and high display) counts for data outside the histogram range. This may already be done using the HRS and LRS count but will need to check this out.
Hope this makes sense.
#4 Updated by Tammy Becker over 7 years ago
I think I might need a picture of what you mean with underbin...or binning at all for a histogram...isn't every dn represented with a count??...so, if binning is a requirement and if I understand your explanation, could we have the following?
BTW, A change that would involve a 'ripple' through the entire ISIS sytem is a little concerning...does this mean very min/max reported is by Chebyshev rather than percent??
1) bin_option = [real image min/max; *default]
[user entered min/max]
Regardless of which the user chooses, please report the min/max values for real and Chevyshev options. Also, I really like your ideas of when the Chevyshev min/max is selected or user has entered min/max the the count of points below/above are reported...even a LIS or HIS count would be helpful...
#5 Updated by Laszlo Kestay over 7 years ago
I understand you are trying to provide a visually pleasing/meaningful histogram. However, the answer is not to chuck data that is >14 standard deviations from the mean. Many other software packages have dealt with this issue. One solution is to go (optionally) to logarithmic bins. Another is to (optionally) bin by standard deviations. Another is to allow the user to define the bins, possibly unevenly. But the default should be to present the user with the ugly truth and then let them decide what to do with that. Having to go run a different program to determine what the actual min/max of an image is, then go type those in manually, to get a true histogram does not seem like a good solution. Again, the user of ISIS is after scientifically rigorous output, not a pretty looking plot. Do you have any science users who thought this was a good implementation?
#6 Updated by Tammy Becker over 7 years ago
I have attached two histogram csv output files. One is what I posted above and I added a second one which does not have such a high 'outlier'...
The files are on /usgs/shareall/tbecker/HistTest/
1) 5239r_cal_i3.cub (attached 5239_cal_i3_hist.csv)
2) 5026r_cal_i3.cub (attached 5026_cal_i3_hist.csv)
#7 Updated by Jeff Anderson over 7 years ago
After further discussions with Tammy output of the range (min/max) of each bin would help address her issue. The bin range for the last line of the histogram output would have been roughly 57.354 to 2040.9. It would have been clearer that the image maximum was indeed included in the histogram and not "chucked out"
Please note this is contrary to Steven's and my previous posts. All pixels are included in the histogram. Data below/above the computed or user specified min/max are included in the first/last bins, respectively. This means the first and last bins have different widths. All the other bins have a uniform width.
#8 Updated by Jeff Anderson over 7 years ago
I would recommend opening a new ticket in Mantis for the request of other binning options. A work around for now is to run "fx" on the data (using log or sqrt). You would need to apply the inverse function to the bin min, max, and center columns in the CSV file.
There are certainly options we should add to "hist" for visualizing the histogram (i.e., making a pretty plot). For instance, display the entire histogram (could be undesirable if you have a few extreme outliers) or display the histogram starting from the 0.5% to the 99.5% cumulative frequency (good for eliminating outliers). I do think you should take a moment and run "hist" in the full gui mode. You will see that it both outputs the min/max of the image and allows the user to change that min/max on subsequent runs. You don't have to run two different programs. If you would like I can show you.
#9 Updated by Jeff Anderson over 7 years ago
Summary: I am holding off on making any fixes until we can document the program further. Fundamentally, we need to describe how the binning of the histogram is done. Then when need to decide if the binning is being done correctly. This is a perfect project for a new student. I will save this in my active queue until I hire a new student and will guide him through the process.
#15 Updated by Moses Milazzo almost 4 years ago
- Tracker changed from Documentation to Bug
This is being moved from a documentation issue to a bug and is being added to FY15Q3.
For scientific purposes, histograms need to represent what the user needs them to represent. Under/over binning may be required by the user, or un-equally-sized bins may be required, or an accurate representation of 16-sigma data relative to the rest of the data may be required. It should not be the default to throw away data when creating a histogram, even if those data seem to be outside of reasonableness; science is advanced on the edges of reasonableness and we need to see those data. Yes, sometimes they're truly just junk, but sometimes they tell us something important.
Obviously when this is fixed, the documentation will also need to be updated and will need to describe what is happening.