Data assimilation systems and retrieval systems that are based upon
a maximum likelihood estimation, many of which are in operational
use, rely on the assumption that all of the errors and variables
involved follow a normal distribution. This work develops a series
of statistical tests to show that mixing ratio, temperature, wind
and surface pressure follow non-normal, or in fact, lognormal
distributions thus impacting the design-basis of many operational
data assimilation and retrieval systems. For this study one year of
Global Forecast System 00:00 UTC 6
It has been documented several times that there are variables in the atmosphere that come from non-normal distributions (Biondini, 1976; López,1977; Mielke et al.,1977; Toth and Szentimrey, 1990; Sauvageot, 1994; Yang and Pierrehumbert, 1994; Miles et al., 2000; O'Neill et al., 2000; Harmel et al., 2002; Stephens et al., 2002; Foster and Bevis, 2003; Zhang et al., 2003; Cho et al., 2004; Sengupta et al., 2004; Foster et al., 2006; Perron and Sura, 2013). It is shown in Fletcher (2010), that atmospheric variables may come from different probability distributions depending on the season. If this is the case then the variables' inherent distributions could also be conditioned on large-scale climatic dynamics.
If a testing procedure can be established that determines the nature of a variable, then an appropriate analysis scheme may be chosen to suit the variable. Many numerical weather prediction centers include some form of variational data assimilation, or Kalman filter, for their analyses and forecasting scheme, that are dependent on a normal distribution assumption for the error description. These centers include the Met Office (Rawlins et al., 2000), the European Centre for Medium-Range Weather Forecasts (Rabier et al., 2000), Météo-France (Fischer et al., 2005), Meteorological Service of Canada (Gauthier et al., 2007), the Naval Research Laboratory (NRL) Atmospheric Variational Data Assimilation System-Accelerated Representer (Rosmond and Xu, 2006), and the National Centers for Environmental Predication's Gridpoint Statistical Interpolation (Kleist et al., 2009). For more thorough reviews of variational data assimilation see Fletcher (2010) and Fletcher and Jones (2014).
In addition to operational data assimilation systems, the normal distribution assumption for the modeling of errors is also made for satellite retrieval systems, for example in the National Oceanic and Atmospheric Administration's Microwave Integrated Retrieval System (MiRS) (Boukabara et al., 2011), but where a logarithmic transform is used to convert a lognormally distributed variable into a normally distributed variable. This transform approach is also used in the Canadian Middle Atmosphere Model to make the state more normally-distributed (Polavarapu et al., 2005). While moment statistics have been used to analyze atmospheric variables (Perron and Sura, 2013), as of this writing the authors are unaware of any testing procedure attempting to classify the statistical framework of mixing ratio, temperature, wind and surface pressure.
There have been previous studies that have shown that variables
including precipitation (Biondini, 1976; Mielke et al., 1977;
Sauvageot, 1994; Cho et al., 2004), total precipitable water (Foster
and Bevis, 2003; Foster et al., 2006), extreme temperatures (Toth and
Szentimrey, 1990; Harmel et al., 2002), cloud and radar echo
populations (López, 1997) cloud droplet size (Miles et al., 2000),
liquid water path (Sengupta et al., 2004; Stephens et al., 2002),
aerosol optical depth (O'Neill et al., 2000), tropical water vapor
(Zhang et al., 2003), and relative humidity (Yang and Pierrehumbert,
1994) do not conform to a normal distribution to describe their
behavior. A climatology of nine variables' distributional
characteristics is analyzed in Perron and Sura (2013). Some of the
studies considered spatial data while others used time-series data.
These studies used a variety of techniques to quantify the nature of
these distributions including probability density-fitting via moment
calculations,
Across a variety of disciplines it is often convenient, and somewhat innocuous, to treat measured variables as normally distributed in nature. This can misrepresent the inherent summary statistics due to a loss of information (e.g., lack of higher statistical moment information), and can be harmful within certain applications of the data. If a model, or algorithm, incorrectly assumes that a random variable is normally distributed then the properties of this distribution may skew its output.
A variable's probability distribution dictates the probabilistic solution
that is found when using data assimilation techniques. In 3-D variational
assimilation the cost function is given by
Biases could also be introduced in data assimilation and retrieval systems that assume variables, and hence their errors, are normally distributed when they actually follow a non-normal distribution in nature. A clear example of where this can be problematic is if a computed value is physically impossible, such as relative humidity taking a negative value. This dubious value may be incorrectly incorporated into the analyses, or reset to a lower bound near zero. In either case this is certainly less desirable than solving for the correct value using an appropriate scheme that incorporates its correct underlying probability distribution. Recently, mixed normal-lognormal variational data assimilation methods have been developed in 3-D (Fletchera and Zupanski, 2006a, b, 2007), and 4-D in Fletcher (2010). These initial full field formulations were not consistent with the current operational incremental configurations. However, a derivation and testing of a mixed multiplicative and additive incremental 3-D- and 4-D-VAR for a control vector that contains both normal and lognormally distributed variables is presented in Fletcher and Jones (2014).
Evidence of how an assimilation scheme improves based on the distribution of the observational errors is shown in Fletcher and Jones (2014). Using the Lorenz '63 chaotic model the authors show that a lognormal-based cost function performs better than the current normal formulation given lognormal errors. Those conclusions result from testing observations of varying accuracy, sparseness in time, and over different window lengths.
Given that there is now a mathematical framework for assimilating mixed normal-lognormally distributed variables/errors, techniques are needed that can inform the user of a mixed system when to switch between a full normal distribution-based version or a mixed normal-lognormal-based version to optimize the performance of the system and to make it consistent with the “current” observed probabilistic behavior.
Therefore, the motivation of this work is to design a set of tests that can be performed offline between cycles or windows such that the configuration for the approximation for the background error covariance matrix, cost function, Jacobian and approximations to the Hessian, if used, can be ready for the next minimization step.
Given the motivation to detect a non-normal, specifically a lognormal signal,
we use 1
While there are transformation techniques employed by operational centers for moisture (Bocquet et al., 2010), the Navy Operational Global Atmospheric Prediction System (NOGAPS) previously used the logarithm of specific humidity (Eckermann et al., 2004), which is equivalent to mixing ratio (Dee and da Silva, 2003) analyzed in this study. In Fletcher and Zupanski (2007) it is shown that a logarithmic transformation finds the median in multivariate lognormal space, which is positively biased relative to the mode, or the most likely state.
In this work we propose using easily calculable statistics and hypothesis
testing to show that these variables described above show strong evidence of
a non-normal nature, or more specifically, a lognormal behavior. The
hypothesis tests considered in this paper include the Jarque–Bera,
Shapiro–Wilk, and
The format of the remainder of this paper proceeds as follows: Sect. 2 describes the formulation of the hypothesis tests as well as the test statistics. In Sect. 3 results of these tests are presented. In Sect. 4 conclusions and a discussion of the results of Sect. 3 are presented.
In this section the statistical methods that are used to detect a non-normal
distribution signal are presented along with tests to see if the distribution
is a lognormal distribution. The random sample
The samples' autocorrelation has been checked in order to verify the iid
assumption for the hypothesis tests. While there is some autorcorrelation in
the samples, we attempt to minimize its effect by choosing such a small
For the Shapiro–Wilk and the Jarque–Bera tests (Hain, 2010) the following
hypotheses are defined, with
In all subsequent presentations of results a returned value of
In an attempt to combine both sets of hypotheses a new “composite test” is
defined. In this test if both the Shapiro–Wilk and the Jarque–Bera tests
reject
As opposed to reporting the skewness and kurtosis of a particular time-series as in Perron and Sura (2013), this information is used to make a decision about the distribution. While the structure of a hypothesis test includes a preconception about the data, multiple tests are combined simultaneously to test both directions of the normality assumption. This design ensures that the data truly is lognormally distributed without a false positive. The authors are not aware of this technique having been previously applied.
Let
Then the Shapiro–Wilk (SW) test statistic is given by
A thorough mathematical explanation of this statistic is presented in Hain
(2010). Razali and Wah (2011) has found that the Shapiro–Wilk test
outperforms in power the Kolmogorov–Smirnov, Lilliefors, and
Anderson–Darling tests for both symmetric and non-symmetric distributions
based on sample size. The power of a test is the probability of not
committing a Type II error, which occurs when
Clear differences between the normal and lognormal distributions include
skewness and kurtosis. Skewness essentially determines the asymmetry of
a distribution. This statistic can be positive, negative or zero and is the
third moment of a random variables' probability distribution. Kurtosis, the
fourth moment, measures how peaked the distribution is. Descriptions of these
statistics can be found in Casella and Berger (2002). The Jarque–Bera test
combines these statistics to determine their goodness-of-fit to a normal
distribution. If the distribution is normal, then asymptotically the
Jarque–Bera (JB) test statistic has a
With the null hypothesis of the chi-squared test being that the data come
from a lognormal distribution, the test statistic compares expected,
Much more can be said about these hypothesis tests but that is outside of the scope of this paper. Those details are left out in lieu of the application results as applied to the GFS data.
The normal and lognormal probability density functions are fitted to the data
using the maximum likelihood technique. For an independent and identically
distributed sample
For each sample point
The results of the time-series hypothesis tests for mixing ratio and temperature resulted in numerous figures and data plots displaying the non-normal and lognormal nature of the GFS data. An overview of these results is presented along with a more detailed analysis of specific points of interest. Instead of presenting the results of the Shapiro–Wilk, Jarque–Bera, and Chi-squared tests only the results of the Composite Test are shown which incorporate all of the results simultaneously.
For each point of the GFS data, an forecast from each day between
1 January 2005 through 31 December 2005 makes up the random variable
A tabulated view of all of the tests results can be seen in Fig. 1.
Frequencies depict how often the Shapiro–Wilk and Jarque–Bera tests reject
Figure 2 shows the results of the composite test at 300
To see what a sample of the data actually looks like consider Fig. 4. This
data is at 300
Another location of interest which experiences significant continental air
masses (Trewartha and Horn, 1971) is in central North America where tornadoes
frequently develop. Figure 5 shows the data and probability fits at
300
Figure 6 shows the data and distribution fits for a point in the tropical
cyclone formation region in the North Atlantic at 500
For a location near Japan at 850
Closer inspection of many more vertical levels and locations could be shown but are omitted due to limitations of space.
Similar to Fig. 1 for mixing ratio, statistical test results are presented
for temperature in Fig. 10. It is clear that the composite test concludes
that the non-normal and lognormal signals are seen to be much less pronounced
for temperature than for mixing ratio. However there are still numerous
occurances as determined by the strict hypothesis tests. Inspection of the
composite test results for 500 and 700
By looking at the results of the Shapiro–Wilk and Jarque–Bera tests, there
are occurances where the temperature data is seen to come from a non-normal
distribution. There are 77 points out of 65 160 where the data for all of
2005 and each season is not normally distributed, i.e. the null hypothesis is
rejected for these tests on all time domains. All but one of these points are
in the Southern Hemispere, with a majority of points falling between 500 and
1000
While surface pressure is a positive definite random variable, the chi squared test indicated no instances of lognormal behavior. This is a result of the data typically being right-skewed if the normal assumption is rejected.
While non-normal behavior is not as prevalent in surface pressure as in mixing ratio, the frequency can be seen in Fig. 13. Here the composite test indicates the frequency that the Jarque–Bera and Shapiro–Wilk reject the null hypothesis, omitting the Chi squared test. Spatial coverage of the composite test is shown in Fig. 14.
An interesting presentation of the number of seasons where the normality assumption is rejected by the composite test is shown in Fig. 15. Here, areas over the ocean are seen more often to have non-normally distributed surface pressure than over land.
Since the GFS wind data is not a positive definite random variable, the lognormal distribution is not a viable candidate to capture its shape or spread. Therefore, for wind, the composite test now reports when both the Shapiro–Wilk and Jarque–Bera tests simultaneously reject the null hypothesis that the data comes from a normal distribution. Since a much more thorough review of the probability distributions of wind has been conducted by Carta et al. (2009), a brief inclusion of the results is presented here, which corroborate the non-normal behavior of wind that has been previously observed.
Figures 16 and 17 show the frequency that each test rejected normality as
well as where they overlap in the composite test for the
Closer inspection of the nature of the skewed and bi-modal behavior of
Given these results for mixing ratio, temperature, surface pressure, and
wind, a real-time detection method may include a moving-average that includes
the last
In this section different variables, vertical levels, time domains, and locations have been presented demonstrating non-normal or logormal (or neither) behavior. Given the prevalence of non-normally distributed random variables the necessity of checking what the data looks like has been demonstrated.
Since mixing ratio and temperature have been shown to be non-normally distributed and in many cases appear to be lognormally distributed, 3-D- and 4-D-VAR data assimilation schemes that include lognormal cost functions for both the observations and the apriori background may be required for more accurate results. This would have implication on the forecast skill of a DA system, or for a retrieval system, as the analysis state from the minimization of the mixed distribution cost function should be consistent with the probabilistic behavior of the true state. The normal assumption, while convenient and easily adaptable, may need to be more carefully considered in light of these results.
While it is true that a lognormal distribution with a small variance looks very similar to a normal distribution, the detection methods used in this paper attempt to operationally handle large amounts of data similar to the resolution of an inner loop in incremental data assimilation schemes. It is in this end that these statistical procedures have been demonstrated in order to understand the true nature of atmospheric variables.
The time-series data clearly indicates data for mixing ratio and temperature will follow a lognormal distribution in certain areas. These results give light to the fact that the normal distribution assumption is not a valid assumption for the basis of the data assimilation and variational based retrieval systems and suggests that more research is needed to study the impact of assuming a normal distribution fit on forecast skills, variational observational quality control as well as the gross error check (Lorenc and Hammon, 1988).
Therefore this work suggests that statistical climatology tests need to be developed on a seasonal, or possibly a monthly basis, as the distributions that are found for specific variables indicate which distribution's cost function should be used in the assimilation schemes as a function of space and time. Ideally a real-time decision of how the data is statistically structured would be determined, ensuring that the correct scheme is chosen. In either case, it is the goal that an objective decision methodology be available for an appropriate scheme based on the nature of the data. The choice under what observational conditions to apply alternative Baysian models is now made as an objective decision through the procedure used and demonstrated in this work.
Future work can consider longer time-series, more vertical levels, other atmospheric variables such as column water vapor when a boundary layer cloud is present as seen in Fletcher (2010), and other statistical methods including the Akaike information criterion (Akaike, 1974). The possible future benefit of the Akaike information criterion (AIC) is that it detects the best distribution for a random variable based on information theory which could then give guidance for what other distributions need to be included in the variational cost function. AIC balances the goodness-of-fit of a distribution while minimizing the number of model parameters.
It has been shown in Fletcher and Jones (2014) that there is a negative impact on the performance of a normal distribution only incremental 4-D-VAR when lognormal forecasts are assimilated. However, when the same observations were assimilated in a lognormal-based incremental 4-D-VAR, then there was no negative impact on the analysis error. Therefore, the need to determine which distribution the observations and their errors come from is important to minimize the impact of these errors on the analysis of a DA system and the subsequent forecast. In this paper methodologies have been developed and tested with the 2005 GFS 00:00 UTC 6 h forecast and it has been shown that there are lognormal signals in the forecasts. This therefore suggests a need for statistical climatologies to be developed and for these climatologies to also be linked in near real-time with the data assimilation and retrieval systems.
Let
Without loss of generality consider the univariate case. For a random
variable
Since this equation is an exponential, the sum
Equation (A1) can be written as
If it assumed that
Section 3 contains results that atmospheric random variables can have
a non-normal, or in particular, a lognormal distribution. This would imply
that the right hand side of Eq. (A4) would be the sum of a normal and
a lognormal distribution. An assumption such as this for the sought after
state
This work is primarily supported by the National Science Foundation
via grant AGS-1038790 at CIRA/Colorado State University and the GFS
data were obtained from the National Climatic Data Center at
Composite results for water vapor mixing ratio for
Similar to Fig. 2, composite results for mixing ratio for
Histograms along with Normal and Lognormal probability
distibution for
Similar to Fig. 4 at 300
Similar to Fig. 4 at 500
Location near Japan at 850
Similar to Fig. 2, composite results for
Similar to Fig. 2, composite results for
Frequency of each test result for temperature on every time domain and atmospheric level similar to Fig. 1. There are a significant number of points where non-normal and lognormally-distributed data appear, both annually and seasonally.
Temperature data for a point near Taiwan where the Shapiro–Wilk and Jarque–Bera conclude non-normally distributed data.
Temperature data for a point in Australia where the Shapiro–Wilk and Jarque–Bera conclude non-normally distributed data.
Similar to Fig.
Frequency (0–4) of seasons determined to be non-normal by the composite test.
Similar to Fig. 4, histograms along with a normal probability
distibution for