kwarray.util_robust module¶

Functions relating to robust statistical methods for normalizing data.

kwarray.util_robust.find_robust_normalizers(data, params='auto')[source]¶

Finds robust normalization statistics a set of scalar observations.

The idea is to estimate “fense” parameters: minimum and maximum values where anything under / above these values are likely outliers. For non-linear normalizaiton schemes we can also estimate an likely middle and extent of the data.

Parameters:

data (ndarray) – a 1D numpy array where invalid data has already been removed
params (str | dict) – normalization params.

When passed as a dictionary valid params are:

scaling (str):
This is the “mode” that will be used in the final normalization. Currently has no impact on the Defaults to ‘linear’. Can also be ‘sigmoid’.

extrema (str):
The method for determening what the extrama are. Can be “quantile” for strict quantile clipping Can be “adaptive-quantile” for an IQR-like adjusted quantile method. Can be “tukey” or “IQR” for an exact IQR method.

low (float): This is the low quantile for likely inliers.

mid (float): This is the middle quantlie for likely inliers.

high (float): This is the high quantile for likely inliers.

Can be specified as a concise string.

The string “auto” defaults to:
dict(extrema='adaptive-quantile', scaling='linear', low=0.01, mid=0.5, high=0.9).

The string “tukey” defaults to:
dict(extrema='tukey', scaling='linear').

Returns:

normalization parameters that can be passed to kwarray.normalize() containing the keys:

type (str): which is always ‘normalize’

mode (str): the value of params['scaling']

min_val (float): the determined “robust” minimum inlier value.

max_val (float): the determined “robust” maximum inlier value.

beta (float): the determined “robust” middle value for use in: non-linear normalizers.
alpha (float): the determined “robust” extent value for use in: non-linear normalizers.

Return type:

Dict[str, str | float]

Note

The defaults and methods of this function are subject to change.

Todo

[ ] No (or minimal) Magic Numbers! Use first principles to deterimine defaults.
[ ] Probably a lot of literature on the subject.
[ ] https://arxiv.org/pdf/1707.09752.pdf
[ ] https://www.tandfonline.com/doi/full/10.1080/02664763.2019.1671961
[ ] https://www.rips-irsp.com/articles/10.5334/irsp.289/
[ ] This function is not possible to get right in every case
(probably can prove this with a NFL theroem), might be useful to allow the user to specify a “model” which is specific to some domain.

Example

>>> from kwarray.util_robust import *  # NOQA
>>> data = np.random.rand(100)
>>> norm_params1 = find_robust_normalizers(data, params='auto')
>>> norm_params2 = find_robust_normalizers(data, params={'low': 0, 'high': 1.0})
>>> norm_params3 = find_robust_normalizers(np.empty(0), params='auto')
>>> print('norm_params1 = {}'.format(ub.urepr(norm_params1, nl=1)))
>>> print('norm_params2 = {}'.format(ub.urepr(norm_params2, nl=1)))
>>> print('norm_params3 = {}'.format(ub.urepr(norm_params3, nl=1)))

Example

>>> # xdoctest: +REQUIRES(module:scipy)
>>> from kwarray.util_robust import *  # NOQA
>>> from kwarray.distributions import Mixture
>>> import ubelt as ub
>>> # A random mixture distribution for testing
>>> data = Mixture.random(6).sample(3000)

kwarray.util_robust._tukey_quantile_fence(data, clip=False)[source]¶

One might wonder where the 1.5 in the above interval comes from – Paul Velleman, a statistician at Cornell University, was a student of John Tukey, who invented this test for outliers. He wondered the same thing. When he asked Tukey, “Why 1.5?”, Tukey answered, “Because 1 is too small and 2 is too large.” [OxfordShapeSpread].

References

[OxfordShapeSpread]

http://mathcenter.oxford.emory.edu/site/math117/shapeCenterAndSpread/

[YTFindOutliers]

https://www.youtube.com/watch?v=zY1WFMAA-ec

kwarray.util_robust._quantile_extreme_estimator(data, params)[source]¶

kwarray.util_robust._custom_quantile_extreme_estimator(data, params)[source]¶

kwarray.util_robust.robust_normalize(imdata, return_info=False, nodata=None, axis=None, dtype=<class 'numpy.float32'>, params='auto', mask=None)[source]¶

Normalize data intensities using heuristics to help put sensor data with extremely high or low contrast into a visible range.

This function is designed with an emphasis on getting something that is reasonable for visualization.

Todo

[x] Move to kwarray and renamed to robust_normalize?
[ ] Support for M-estimators?

Parameters:

imdata (ndarray) – raw intensity data
return_info (bool) – if True, return information about the chosen normalization heuristic.
params (str | dict) – Can contain keys, low, high, or mid, scaling, extrema e.g. {‘low’: 0.1, ‘mid’: 0.8, ‘high’: 0.9, ‘scaling’: ‘sigmoid’} See documentation in find_robust_normalizers().
axis (None | int) – The axis to normalize over, if unspecified, normalize jointly
nodata (None | int) – A value representing nodata to leave unchanged during normalization, for example 0
dtype (type) – can be float32 or float64
mask (ndarray | None) – A mask indicating what pixels are valid and what pixels should be considered nodata. Mutually exclusive with nodata argument. A mask value of 1 indicates a VALID pixel. A mask value of 0 indicates an INVALID pixel. Note this is the opposite of a masked array.

Returns:

a floating point array with values between 0 and 1. if return_info is specified, also returns extra data

Return type:

ndarray | Tuple[ndarray, Any]

Note

This is effectively a combination of find_robust_normalizers() and normalize().

Example

>>> # xdoctest: +REQUIRES(module:scipy)
>>> from kwarray.util_robust import *  # NOQA
>>> from kwarray.distributions import Mixture
>>> import ubelt as ub
>>> # A random mixture distribution for testing
>>> data = Mixture.random(6).sample(3000)
>>> param_basis = {
>>>     'scaling': ['linear', 'sigmoid'],
>>>     'high': [0.6, 0.8, 0.9, 1.0],
>>> }
>>> param_grid = list(ub.named_product(param_basis))
>>> param_grid += ['auto']
>>> param_grid += ['tukey']
>>> rows = []
>>> rows.append({'key': 'orig', 'result': data})
>>> for params in param_grid:
>>>     key = ub.urepr(params, compact=1)
>>>     result, info = robust_normalize(data, return_info=True, params=params)
>>>     print('key = {}'.format(key))
>>>     print('info = {}'.format(ub.urepr(info, nl=1)))
>>>     rows.append({'key': key, 'info': info, 'result': result})
>>> # xdoctest: +REQUIRES(--show)
>>> import seaborn as sns
>>> import kwplot
>>> kwplot.autompl()
>>> pnum_ = kwplot.PlotNums(nSubplots=len(rows))
>>> for row in rows:
>>>     ax = kwplot.figure(fnum=1, pnum=pnum_()).gca()
>>>     sns.histplot(data=row['result'], kde=True, bins=128, ax=ax, stat='density')
>>>     ax.set_title(row['key'])

Example

>>> # xdoctest: +REQUIRES(module:kwimage)
>>> from kwarray.util_robust import *  # NOQA
>>> import ubelt as ub
>>> import kwimage
>>> import kwarray
>>> s = 512
>>> bit_depth = 11
>>> dtype = np.uint16
>>> max_val = int(2 ** bit_depth)
>>> min_val = int(0)
>>> rng = kwarray.ensure_rng(0)
>>> background = np.random.randint(min_val, max_val, size=(s, s), dtype=dtype)
>>> poly1 = kwimage.Polygon.random(rng=rng).scale(s / 2)
>>> poly2 = kwimage.Polygon.random(rng=rng).scale(s / 2).translate(s / 2)
>>> forground = np.zeros_like(background, dtype=np.uint8)
>>> forground = poly1.fill(forground, value=255)
>>> forground = poly2.fill(forground, value=122)
>>> forground = (kwimage.ensure_float01(forground) * max_val).astype(dtype)
>>> imdata = background + forground
>>> normed, info = kwarray.robust_normalize(imdata, return_info=True)
>>> print('info = {}'.format(ub.urepr(info, nl=1)))
>>> # xdoctest: +REQUIRES(--show)
>>> import kwplot
>>> kwplot.autompl()
>>> kwplot.imshow(imdata, pnum=(1, 2, 1), fnum=1)
>>> kwplot.imshow(normed, pnum=(1, 2, 2), fnum=1)

_images/fig_kwarray_util_robust_robust_normalize_002.jpeg

kwarray.util_robust._apply_robust_normalizer(normalizer, imdata, imdata_valid, mask, dtype, copy=True)[source]¶: Todo

abstract into a scikit-learn-style Normalizer class which can fit/predict different types of normalizers.

kwarray.util_robust.normalize(arr, mode='linear', alpha=None, beta=None, out=None, min_val=None, max_val=None)[source]¶

Normalizes input values based on a specified scheme.

The default behavior is a linear normalization between 0.0 and 1.0 based on the min/max values of the input. Parameters can be specified to achieve more general constrat stretching or signal rebalancing. Implements the linear and sigmoid normalization methods described in [WikiNorm].

Parameters:

arr (NDArray) – array to normalize, usually an image
out (NDArray | None) – output array. Note, that we will create an internal floating point copy for integer computations.
mode (str) – either linear or sigmoid.
alpha (float) – Only used if mode=sigmoid. Division factor (pre-sigmoid). If unspecified computed as: max(abs(old_min - beta), abs(old_max - beta)) / 6.212606. Note this parameter is sensitive to if the input is a float or uint8 image.
beta (float) – subtractive factor (pre-sigmoid). This should be the intensity of the most interesting bits of the image, i.e. bring them to the center (0) of the distribution. Defaults to (max - min) / 2. Note this parameter is sensitive to if the input is a float or uint8 image.
min_val – inputs lower than this minimum value are clipped
max_val – inputs higher than this maximum value are clipped.

SeeAlso:

find_robust_normalizers() - determine robust parameters for: normalize to mitigate the effect of outliers.
robust_normalize() - finds and applies robust normalization: parameters

References

[WikiNorm]

https://en.wikipedia.org/wiki/Normalization_(image_processing)

Example

>>> raw_f = np.random.rand(8, 8)
>>> norm_f = normalize(raw_f)

>>> raw_f = np.random.rand(8, 8) * 100
>>> norm_f = normalize(raw_f)
>>> assert isclose(norm_f.min(), 0)
>>> assert isclose(norm_f.max(), 1)

>>> raw_u = (np.random.rand(8, 8) * 255).astype(np.uint8)
>>> norm_u = normalize(raw_u)

Example

>>> # xdoctest: +REQUIRES(module:kwimage)
>>> import kwimage
>>> arr = kwimage.grab_test_image('lowcontrast')
>>> arr = kwimage.ensure_float01(arr)
>>> norms = {}
>>> norms['arr'] = arr.copy()
>>> norms['linear'] = normalize(arr, mode='linear')
>>> # xdoctest: +REQUIRES(module:scipy)
>>> norms['sigmoid'] = normalize(arr, mode='sigmoid')
>>> # xdoctest: +REQUIRES(--show)
>>> import kwplot
>>> kwplot.autompl()
>>> kwplot.figure(fnum=1, doclf=True)
>>> pnum_ = kwplot.PlotNums(nSubplots=len(norms))
>>> for key, img in norms.items():
>>>     kwplot.imshow(img, pnum=pnum_(), title=key)

_images/fig_kwarray_util_robust_normalize_002.jpeg

Example

>>> # xdoctest: +REQUIRES(module:kwimage)
>>> arr = np.array([np.inf])
>>> normalize(arr, mode='linear')
>>> # xdoctest: +REQUIRES(module:scipy)
>>> normalize(arr, mode='sigmoid')
>>> # xdoctest: +REQUIRES(--show)
>>> import kwplot
>>> kwplot.autompl()
>>> kwplot.figure(fnum=1, doclf=True)
>>> pnum_ = kwplot.PlotNums(nSubplots=len(norms))
>>> for key, img in norms.items():
>>>     kwplot.imshow(img, pnum=pnum_(), title=key)

Benchmark

>>> # Our method is faster than standard in-line implementations for
>>> # uint8 and competative with in-line float32, in addition to being
>>> # more concise and configurable. In 3.11 all inplace variants are
>>> # faster.
>>> # xdoctest: +REQUIRES(module:kwimage)
>>> import timerit
>>> import kwimage
>>> import kwarray
>>> ti = timerit.Timerit(1000, bestof=10, verbose=2, unit='ms')
>>> arr = kwimage.grab_test_image('lowcontrast', dsize=(512, 512))
>>> #
>>> arr = kwimage.ensure_float01(arr)
>>> out = arr.copy()
>>> for timer in ti.reset('inline_naive(float)'):
>>>     with timer:
>>>         (arr - arr.min()) / (arr.max() - arr.min())
>>> #
>>> for timer in ti.reset('inline_faster(float)'):
>>>     with timer:
>>>         max_ = arr.max()
>>>         min_ = arr.min()
>>>         result = (arr - min_) / (max_ - min_)
>>> #
>>> for timer in ti.reset('kwarray.normalize(float)'):
>>>     with timer:
>>>         kwarray.normalize(arr)
>>> #
>>> for timer in ti.reset('kwarray.normalize(float, inplace)'):
>>>     with timer:
>>>         kwarray.normalize(arr, out=out)
>>> #
>>> arr = kwimage.ensure_uint255(arr)
>>> out = arr.copy()
>>> for timer in ti.reset('inline_naive(uint8)'):
>>>     with timer:
>>>         (arr - arr.min()) / (arr.max() - arr.min())
>>> #
>>> for timer in ti.reset('inline_faster(uint8)'):
>>>     with timer:
>>>         max_ = arr.max()
>>>         min_ = arr.min()
>>>         result = (arr - min_) / (max_ - min_)
>>> #
>>> for timer in ti.reset('kwarray.normalize(uint8)'):
>>>     with timer:
>>>         kwarray.normalize(arr)
>>> #
>>> for timer in ti.reset('kwarray.normalize(uint8, inplace)'):
>>>     with timer:
>>>         kwarray.normalize(arr, out=out)
>>> print('ti.rankings = {}'.format(ub.urepr(
>>>     ti.rankings, nl=2, align=':', precision=5)))

_images/fig_kwarray_util_robust_normalize_003.jpeg