Statistical Formulas

Top Previous Next

Here is a more detailed description of the numbers displayed in Walls dialogs -- namely the UVEs that rank loop systems, the F-ratios that rank traverses, and the best corrections that identify data errors. This is an attempt to state exactly what's being computed by the program, not to explain fully the theory and derivations. Information on the statistical terms and distributions can be looked up in appropriate textbooks. How the numbers are used during data screening is discussed in Geometry Page, Traverse Page, and Data Screening Tutorial.

We start by letting X , Y, and Z represent the components of a measured traverse, with Vh and Vz representing their assigned error variances. The modifier h is used to indicate that the horizontal components (X and Y) are being given identical assignments. Also, we'll assume that the traverse in question is not constrained: Vh > 0 and Vz > 0. (For a description of how Walls computes a traverse's error variance see Variance Assignments.)

First, we perform a least-squares adjustment of a loop system, thereby obtaining for each traverse the final estimated displacement: x, y, z, and the theoretical variances of this estimate: vh and vz (the network variances). Also provided by the adjustment routine are the loop counts, Nh and Nz, and the quantities actually minimized, the sums of squares of weighted residuals:

SSh = Sum over all unconstrained traverses of ((X - x)² + (Y - y)²) / Vh, and

SSz = Sum over all unconstrained traverses of (Z - z)² / Vz,

The unit variance estimates are then

UVEh = SSh / (2 · Nh) and UVEz = SSz / Nz.

If we assume that a traverse's error components actually have the assigned variances, and that they are also independent, normally distributed random variables with zero means, then SSh and SSz have chi-square distributions with 2·Nh and Nz degrees of freedom. UVEh and UVEz would then have unit means and variances 1 / Nh and 2 / Nz, respectively. Anything tending to invalidate these assumptions, such as gross data errors, would presumably inflate the UVEs to values larger than one, the significance of a given increase depending on the loop count.

The approach taken in Walls is less strict than this since we're mainly interested in measuring consistency. If, for the sake of computing useful numbers, we assume only that the assigned variances are correct apart from an unknown scaling factor (i.e., the unit variance), the UVE can be interpreted as a sample variance that estimates this scaling factor. Its expected value after data screening could be significantly different from one depending on the survey's quality (grade), skill of surveyors, and difficulty.

Whereas the UVEs gauge overall consistency of survey data with respect to an assumed model, the F-ratios measure relative consistency. We might ask if a particular observation's contribution to the sum of squares is unusually large when measured against the consistency of the remaining data. It can be shown that if a traverse is deleted from the data set, the sums of squares will decrease by

SEh = ((X - x)² + (Y - y)²) / (Vh - vh) and SEz = (Z - z)² / (Vz - vz).

Now let's simplify our notation by dropping the modifiers (h and z) and using variable M to denote the number of dimensions under consideration: two for the horizontal case and one for the vertical case. The traverse's two F-ratios are then computed as

F = (SE / M) / ((SS - SE) / (M·(N - 1))).

Note that the denominator of F is the new system UVE resulting from the traverse's deletion. While the formula for F depends on the relative sizes of variances assigned to traverses, it's clear that the variance scaling factor cancels out. Although M also cancels out of this expression, we leave it in to illustrate that the numerator and denominator of F (under the assumptions mentioned above for SS) are independent chi-square statistics scaled to have unit means. F, like a loop system's UVE, is a nonnegative number, possibly very large. If it is greater than one, the traverse's presence in the data is hurting consistency (making the system UVE larger). If it is less than one, consistency is being helped.

Finally, the traverse's best correction is

Cz = - (Z - z) · Vz / (Vz - vz), with Cx and Cy obtained similarly.

Thus the key results we need for data screening depend on our adjustment algorithm having provided the network variances, vh and vz. I'll omit the derivation of these formulas; if they were incorrect the Walls review dialogs (e.g., float operations) wouldn't work at all.

Some Notes on Interpretation

The F-ratios displayed by Walls are so named because of their presumed behavior under ideal circumstances. If there were no blunders, and if our simple statistical model (including the normality assumption) was realistic, the F-ratio computed for a traverse in a system with loop count N would have a central F-distribution with M and M·(N - 1) degrees of freedom. You can find information on this distribution in most introductory statistics texts. It's worth noting that the expected value of F is N / (N - 2) for N > 2, or about one for systems with large loop counts.

One approach to understanding what would happen to F if we were to introduce a blunder -- say a fixed vector E added to the random traverse measurement -- is to consider its behavior as a random variable with a non-central F-distribution. A much simpler approach is to realize that perturbing the traverse by vector E, simply adds -E to the traverse's best correction. You'll notice that while the denominator in the above expression for F (the UVE after deletion) doesn't change, the numerator SE / M certainly does and usually does so dramatically. In fact, we can express SE in terms of the best correction's squared length (C² = Cx² + Cy² when M = 2, or C² = Cz² when M = 1):

SE = C² / (V + vº),

where vº = v·V / (V - v) is the corresponding network variance with the traverse discarded. If we assume that the best correction without the blunder is of negligible size -- its expected squared length theoretically being M·(V + vº) -- then it's obvious that a blunder will inflate the F-ratio at a rate proportional to the blunder's squared length. It's also clear that for a blunder of fixed size, F will on average be larger when the quantity V + vº is smaller. A small vº indicates that the remaining network, by virtue of its geometry and the variances assigned to it, can in principle provide a good estimate of the true displacement. Finally, in the case where both blunder and assigned variances are fixed, F will be larger when the consistency actually achieved by the remaining network is better -- that is, when F's denominator, the UVE after detachment, is smaller.

While these theoretical observations suggest why the F-ratios are effective, there is certainly no critical dependence in Walls on the assumptions made about variances and distributions, which, of course, are never strictly valid anyway. All that really matters is that if a traverse is affected by one or more blunders, its horizontal and/or vertical F-ratio is likely to be much larger than those of traverses not so affected. By experimenting with the program you should have no trouble verifying this.

F-ratios have been recognized for their ability to flag outliers in measurement data; however, they aren't as popular as other methods because they're relatively hard to compute. In our case, obtaining the necessary network variances requires the partial inversion of a very large matrix -- more than what's required by a least-squares adjustment alone. Therefore, for simply ranking traverses, why don't we consider something more familiar -- like "error ratios", which are just the weighted least-squares residuals? The main reason we don't is that statistics based on residuals are much weaker than F-ratios because they tend to be highly correlated. In a large network, for example, a severe blunder could easily make the good traverses nearby look worse than bad traverses some distance away. The F-ratios, while still correlated in a strict sense, are so specific in their sensitivity that we've had no trouble identifying multiple blunders after a single data compilation. Besides that, the F-ratios are conceptually akin to UVEs and come hand-in-hand with the best corrections.

Special Cases

There are two special cases that I will mention only briefly. First, if the traverse is constrained, then V = v = 0, and trying to retrieve the statistics in the above manner would be rather futile. The Walls adjustment module simply wasn't designed to provide the information we need in this case, even though the statistics are in fact definable in terms of the unknown quantity vº > 0. Hence, "<FIXED>" is displayed in place of the statistics in Walls dialogs. To obtain just the best correction, the user can temporarily float the traverse. To obtain F as well, the user can assign an insignificant but positive variance to the traverse instead of strictly constraining it.

Second, when we are dealing with an isolated loop, the system's loop count is actually N = 1, again making the above formulas inapplicable. So that a legitimate F-ratio can still be derived, the entire set of isolated loops belonging to a connected component is treated in Walls as if it were a single loop system. The denominator in F is therefore the combined UVE for all other isolated loops. If there are no others, then Walls simply displays the traverse's UVE (e.g., UVEz = Z² / Vz) in place of F.