Opinion for the Court filed by Chief Judge WALD.
In this action, a class of women plaintiffs allege various forms of unlawful employment discrimination in the Foreign Service from 1976 to 1983. After a trial, the Dis
*87
trict Court found that no unlawful discrimination had occurred.
See
I. Background Information
A. The Foreign Service and Its Employment Practices
The Foreign Service is our nation’s professional diplomatic corps. Members of the Service represent the interests of this nation abroad and assist the Secretary of State in the formulation of foreign policy at home. See 22 U.S.C. § 3904(1)-(1)2). The organization of Foreign Service personnel draws on the model of the United States military as well as the United States civil service. See S.Rep. No. 913, 96th Cong., 2d Sess. 2 (1980), U.S.Code Cong. & Admin. News 1980, P. 4419. For example, the Foreign Service is a “rank-in-person” system: members of the Service have an individualized rank which is independent of the rank of the particular job they happen to hold at any given time. H.R.Rep. No. 992, pt. 1, 96th Cong., 2d Sess. 3 (1980).
The Foreign Service also copies the military in its “up or out” personnel system. Individuals must serve a probationary period of up to five years before they can receive a career appointment in the Service. 22 U.S.C. § 3946. If at the end of that period an individual has not received a career appointment, he or she must leave the Service. Id. § 3949. (Although according to the Foreign Service Act of 1980, the term “Foreign Service Officer” refers only to members of the Service with career appointments, and those serving under a limited, probationary appointment are called “career candidates,” the parties to this lawsuit use the term “Foreign Service Officer,” or “FSO,” to refer to those serving under both career and limited appointments. To avoid confusion, we will do likewise.)
The Foreign Service assigns its officers to one of four areas of functional specialization, known as “cones”: political, economic, administrative, and consular. Officers in the political and economic cones deal with, respectively, political and economic dimensions to foreign relations and foreign policy. Officers in the administrative cone “are responsible for the support operations of U.S. embassies and consulates.”
Most FSOs applying to the Foreign Service at junior entry levels must take a written examination. Beginning in 1975, the examinations have tested applicants for aptitude in all four functional areas, and the Foreign Service has used the results of these examinations to determine a new FSO’s initial cone assignment. Id. at 1545 (1115.) 1 A relatively small number of individuals have entered the Service laterally *88 as mid-level FSOs. These lateral entrants bypassed the examination process and “selected, in advance, the functional field in which they wished to compete and were evaluated only for that specific cone.” Id. (H 17). 2
Once in the Foreign Service, individuals change specific jobs frequently; the State Department has a policy of assigning individuals to positions for a set period of time, generally two to three years. See id. at 1550 (1171); H.Rep. No. 96-992, pt. 1, 96th Cong., 1st Sess. 3 (1980). Since 1975, job assignments in the Foreign Service have been made pursuant to an Open Assignment Policy, in which all members of the Service receive a list of vacant positions and submit “a bid list” indicating their preferences. These bid lists are compiled into a “bid book” from which assignment panels make their selections, after considering the interests and preferences of the bureau in which each position is located. Id. at 1550 (H1Í 73, 74). As previously indicated, some FSOs receive “out-of-cone” assignments pursuant to this process but in the main, job transfers are made inside the cones of initial assignment. In addition, FSOs do not necessarily receive a job position with a rank corresponding to the individual’s personal rank. Positions that have a higher rank than the individual are known as “stretch” assignments. Positions with a lower rank than the individual’s are “down-stretch” assignments. Pursuant to the Open Assignment Policy, individuals do not receive stretch or down-stretch assignments unless they bid for them, but as with any other assignment, individuals do not receive these assignments simply because they bid for them. Id. at 1551 (1177).
The Foreign Service prepares annual written evaluations of its officers’ job performance. In addition to rating the actual past performances of FSO’s, the evaluations rate the potential of the FSOs future job performance.
Except for Senior members, salaries in the Foreign Service are based on a schedule established by the President which consists of nine salary classes. 22 U.S.C. § 3963. The Secretary of State assigns all Foreign Service Officers to a particular salary class. Id. § 3964. By statute, except in limited circumstances, a career candidate for appointment as a Foreign Service Officer may not be initially assigned to a salary class higher than class 4 (class 1 being the highest). Id. § 3947. Usually career candidates are placed initially in class 7 or class 8. Promotions from one salary class to another are made by the Secretary of State after receiving recommendations and rankings submitted by selection boards which evaluate the members of each class. Foreign Service Officers do not compete for promotions until the transition from class 6 to class 5; until then, they are promoted at the end of an established time period if they perform their duties satisfactorily. See Joint Appendix (“J.A.”) at 117-121; Defendant’s Post-Trial Brief at 96. 3
*89 B. The History of This Litigation
This class action began over ten years ago when appellants filed their complaint alleging that widespread discrimination against women in the Foreign Service violated Title VII of the Civil Rights Act of 1964, as amended in 1972 to cover employment discrimination in the federal government.
See
42 U.S.C. § 2000e-16. The parties subsequently resolved by consent decree all claims relating to admission into the Foreign Service.
4
The appellants’ claims of discriminatory personnel actions against women already in the Foreign Service proceeded to trial in the District Court. The parties agreed to try initially only the issue of liability, leaving appropriate remedies to a subsequent phase of the proceedings, if necessary. After trial on the liability issue, the District Court concluded that appellants “failed to show by a preponderance of the evidence any sexual discrimination by the State Department.”
This appeal followed from the District Court’s failure to find sex discrimination in seven different types of personnel practices. 5 First, the appellants claim that from 1976 to 1983, the Foreign Service discriminated against women in the initial cone assignments of entering FSOs; the State Department assigned proportionally fewer women than men to the political cone and proportionately more women than men to the consular cone. This disparity was allegedly caused by the differing scores of women and men on the Foreign Service entrance examinations, producing a disparate impact on women and men candidates in violation of Title VII. Second, women were given proportionally fewer out-of-cone assignments to the program direction cone and proportionally more out-of-cone assignments to the consular cone. Third, women were given proportionally fewer “stretch” assignments and proportionally more “downstretch” assignments than men in the same class. Fourth, women received a disproportionately low number of appointments as Deputy Chief of Mission, the position just below that of Ambassador. Fifth, in its evaluation reports, the State Department gave lower future potential ratings to women than men despite equivalent ratings for their past performance. Sixth, women received a disproportionately low number of Foreign Service Honor Awards. And seventh, the State Department promoted women from class 5 to class 4 at a lower rate than it promoted men.
With respect to each of these seven personnel practices, the appellants offered data showing a disparity between men and women, along with a statistical analysis designed to demonstrate the improbability that a disparity of that scale could result from chance. The data and analysis, they allege, provide a strong basis for inferring that this disparity was the product of unlawful discrimination. In addition, the appellants introduced nonstatistical evidence pertaining generally to the existence of a prejudicial attitude towards women in the Foreign Service from 1976 to 1983. The District Court, however, rejected the inference of unlawful discrimination in each of the seven areas.
In discounting the probative force of appellants’ statistics, the District Court said that their statistical studies rested on faulty data, or flawed methodology, or omitted a crucial variable that would explain the disparity between men and women in a nondiscriminatory way. The District Court also said that some of the statistical evidence focused on too narrow a segment of Foreign Service personnel practices. As we shall explain, the District Court’s treatment of the appellants’ evidence was in some instances contrary to law and in other respects clearly erroneous as a matter of fact.
*90 II. Title VII Claims: Two Different Theories
Under Title VII a plaintiff can rely on either of two different theories to support a claim of unlawful sex discrimination. A “disparate treatment” claim alleges that the defendant intentionally based an employment decision on the sex of the plaintiffs.
See, e.g., International Brotherhood of Teamsters v. United States,
Because these two theories are distinct, we must consider them separately. Appellants’ only disparate impact claim concerns the initial cone assignments; the other six claims involve disparate treatment and we will consider them first.
III. Legal Principles Applying to Pattern or Practice Disparate Treatment Claims
In a typical sex discrimination pattern or practice disparate treatment case, plaintiffs allege the existence of a disparity between men and women in selection rates for a particular job or job benefit and further allege that this disparity was caused by an unlawful bias against members of the disadvantaged sex, usually women. To prevail in their claim, plaintiffs must prove, by a preponderance of the evidence, that these allegations are true. Proof of the disparity itself is based upon a comparison of the proportion of those women eligible for selection who were actually selected with the corresponding proportion of eligible men who were actually selected. Plaintiffs establish a disparity disfavoring women if the evidence demonstrates that the selection rate for eligible women was less than the selection rate for eligible men. Sometimes, the disparity is expressed as the difference between the number of women actually selected and the number of women one would expect to have been selected, assuming equality in the selection rates for men and women. (If one knows the number of women eligible and the selection rate for men, one can determine, using algebra, the expected number of successful women.)
Proof that the observed disparity was caused by an unlawful bias against women need not be direct. Circumstantial evidence that the disparity, more likely than not, was a product of unlawful discrimination will suffice to prove a pattern or practice disparate treatment case.
See Teamsters,
A. Raising An Inference of Discrimination With Statistical Evidence
A disparity between the selection rates of men and women for a particular job or job benefit has one of three possible causes. See D. Baldus & J. Cole, Statistical Proof of Discrimination 291 (1980). First, the disparity may be a product of an unlawful discriminatory animus; this is *91 what plaintiffs are attempting to prove. Second, the disparity may have a legitimate and nondiscriminatory cause. For example, prior experience of a certain type may be an important factor in making certain employment decisions, and if it happened to be true that women on the average have less of this experience than men, one would expect that women could be selected less frequently. Third, the disparity may simply be a product of chance. Even if we may properly assume that, as a general rule, women and men on average are equally qualified to be selected for a particular job or job benefit, for any particular group of men and women who happen to constitute the actual pool of eligible candidates at the time the selections are made, there may be some deviation from this general rule because the actual qualifications of men and women differ from individual to individual and any particular pool of eligible candidates constitutes an inherently random collection of individuals. Thus, even if selections were made entirely on the basis of qualification, without a trace of discriminatory bias, random deviations in the selection rates for men and women may result.
A statistical analysis of a disparity in selection rates can reveal the probability that the disparity is merely a random deviation from perfectly equal selection rates. Statistics, however, cannot entirely rule out the possibility that chance caused the disparity. Nor can statistics determine, if chance is an unlikely explanation, whether the more probable cause was intentional discrimination or a legitimate nondiscriminatory factor in the selection process. See id. at 290-92.
Title VII nevertheless provides that if the disparity between selection rates for men and women is sufficiently large so that the probability that the disparities resulted from chance is sufficiently small, then a court will infer from the numbers alone that, more likely than not, the disparity was a product of unlawful discrimination — unless the defendant can introduce evidence of a nondiscriminatory explanation for the disparity or can rebut the inference of discrimination in some other way.
See Hazelwood School District v. United States,
*92
The preliminary question for a court, then, is at what point is the disparity in selection rates is sufficiently large, or the probability that chance was the cause sufficiently low, for the numbers alone to establish a legitimate inference of discrimination. Although this question is crucial in Title VII litigation, the answers given by courts have been regrettably imprecise. The Supreme Court has twice stated that “[a]s a general rule for ... large samples, if the difference between the expected value and the observed number is greater than two or three standard deviations, then the hypothesis that [the disparity] was random would be suspect to a social scientist.”
Castaneda v. Partida,
This court, using different terminology, has stated that statistical evidence meeting “the .05 level of significance ... [is] certainly sufficient to support an inference of discrimination.”
Segar,
How can a 5% probability of randomness correspond both to a measurement of two standard deviations and a measurement of 1.65 standard deviations, one may reasonably ask? There is a legitimate answer: it depends on whether one is using a “one-tailed” or a “two-tailed” test of statistical significance. A disparity measuring 1.65 standard deviations corresponds to a 5% probability of randomness under a one-tailed test. A disparity measuring two standard deviations (to be more precise, 1.96 standard deviations) corresponds to a 5% probability of randomness under a two-tailed test.
This difference between one-tailed and two-tailed tests obviously requires further explanation. It also presages the obvious question, given the substantial differences in result, of which test is the more appropriate one to use in Title VII cases. Neither this court’s opinion in
Segar
nor the District Court's opinion in this case discusses the difference between “one-tailed” or “two-tailed” approaches. The Supreme Court has given us no explicit guidance on this issue. And, unfortunately, neither side to this litigation has devoted more than a single footnote each to this difficult but important issue.
See
Appellants’ Reply Brief at 32 n. 38; Appellee’s Brief at 62 n. 73. For obvious reasons we, too, confront this issue with some trepidation. But appellants’ and appellee’s evidence on the un-
*93
derpromotion of women from FSO class 5 to class 4 measures 1.88 and 1.76 standard deviations, respectively. (The difference results from the use of some different data.
See
Given the unavoidability of embarking upon a journey into the statistical maze, we begin with the terms “one-tailed” and “two-tailed”; 8 they refer to the “tails” or ends of the bell-shape curve, which represents in graph form a “random normal distribution.” E.g., W. Curtis, Statistical Concepts for Attorneys 72-73 (1983); see Diagram 1 copied from id. In these random distributions, the area under any segment of the bell curve measures the probability of that range of results occurring randomly. Id. Furthermore, the percentage area underneath the bell curve within one standard deviation (<r) distance from the mean (p) of a normal distribution is always the same for all normal distributions (regardless of the specific value of a or p, or the units in which these terms are measured). Thus, the probability of a result randomly occurring that measures within one standard deviation of the mean of the distribution (either greater or lesser than the mean) is the same for all normal distributions: 68.26%. Id. Indeed, this relationship holds true for any distance from the mean, measured in numbers of standard deviations. For example, the probability of a result occurring within two standard deviations from the mean is 95.44% and the probability of a result occurring within three standard deviations is 99.73%. See Diagram 1. Thus, for all normal distributions, the probability of randomness is directly associated with a measurement in numbers of standard deviations.
[[Image here]]
Diagram 1
But for every deviation from the mean of a normal distribution, measured in a certain number of standard deviations, there are two distinct ways of referring to the *94 probability of that result occurring randomly. For example, if fewer women than expécted were selected for a particular job, and this disparity measured 2.17 standard deviations, we can ascertain the probability that women by chance would be underse-lected to this extent or greater. This probability corresponds to the area between 2.17 standard deviations and the end of the bell curve representing the most extreme underselection of women. Standard statistical tables reveal that this probability is only 1.5%. See B. Lindgren & D. Berry, Elementary Statistics 479 (1981).
We can speak of the probability measurement associated with 2.17 standard deviations in another way, however. Although the observed disparity between the actual and expected number of women in this example was an underselection of women, there is a corresponding possibility that women might randomly be overselected such that the difference between the expected number of women selected and the number of women selected due to this random overselection also measures 2.17 standard deviations. The probability of a random deviation from the expected number of women selected with a magnitude of 2.17 standard deviations or larger, resulting from either an underselection or overselection of women, corresponds to the area under the bell curve between 2.17 standard deviations and both extremes of the curves: 3%.
The difference between “one-tailed” and “two-tailed” tests of statistical significance stem from these two different ways of measuring probability. If one decides (as the Segar court did) to reject the hypothesis that an observed disparity from an expected result occurred randomly only if the observed disparity falls within the range of the 5% most extreme possible disparities, one must still decide whether the 5% range should be entirely within only one of the tails of the bell curve, or instead should be divided with half of the range in each tail. Five percent of the total bell curve can be found either in the range from 1.65 standard deviations from the mean to one extreme end of the bell curve or in the area from 1.96 standard deviations to both extreme ends of the bell curve. Compare Diagrams 2 and 3, copied from V. Cangelo-si, P. Taylor & P. Rice, Basic Statistics 173-74 (1979). For this reason, a 5% probability of randomness corresponds to 1.65 or 1.96 standard deviations, depending upon whether one uses a one-tailed or a two-tailed test. (Similarly, 1.65 standard deviations correspond to a 10% probability of randomness under a two-tailed test; and 1.96 standard deviations correspond to a 2.5% probability of randomness under a one-tailed test.)
[[Image here]]
Diagram. 2
[[Image here]]
Diagram S
We are now, hopefully, in a position to address whether in a Title YII case, a court should use a one-tailed or two-tailed test to determine whether statistical evidence alone should raise an inference of unlawful discrimination, recognizing that there is a difference of opinion among courts and commentators on the issue.
Compare, e.g., EEOC v. Federal Reserve Bank of Richmond,
*95 [S]tatistical texts frequently recommend the use of a one-tailed test when the only question of interest is the likelihood of a difference in one direction, e.g., when only a positive disparity between two numbers is of interest. This practice supports the use of a one-tailed test in discrimination cases, since the issue is always whether one group is favored over another. A defendant will argue, however, that both minority and majority groups [or men and women] are protected from discrimination and it is therefore inequitable to disregard the probability of outcomes that may favor either group. Since there is no clear answer to this question, the most desirable approach is an awareness of the conceptual and practical differences between the two types of tests and a consistent use of the same type of test in similar cases whenever practical. We have used two-tailed tests throughout this book.
D. Baldus & J. Cole, Statistical Proof of Discrimination 307-08 (1980) (footnote omitted). In the most recent supplement, however, the authors criticize as “unnecessarily strict” the Fourth Circuit’s decision in EEOC v. Federal Reserve Bank of Richmond to require a two-tailed approach unless “independent evidence indicates the presence of discrimination of the type being challenged.” D. Baldus & J. Cole, Statistical Proof of Discrimination 129 (1986 Cumulative Supp.) (footnote omitted). Bal-dus and Cole then state a preference for a legal rule that would allow a one-tailed test “if the possibility of intentional discrimination favoring the protected group represented by plaintiff [e.g., women in this case] can be ruled out as defying logic, i.e., the available evidence excluding the statistic in question gives strong support to the conclusion that the system is either nondiscriminatory or disadvantageous to the plaintiff’s group.” Id. at 129-30. In a footnote to this passage, the authors continue:
The logic underlying this statement is that if one can be certain that there was no discrimination in favor of plaintiff’s group, then any disproportionate impact would simply be interpreted as being a chance outcome in an equitable process.
Id. at 130 n. 38.
Although the latest position adopted by Baldus and Cole makes some sense, we reject its applicability to the present case. We note that some of appellants’ claims of unlawful discrimination involved complaints that women were overselected for particular kinds of jobs, e.g., consular cone and downstretch assignments. Appellants undoubtedly have the right under Title VII to object to the State Department’s selection of FSOs for these positions on the basis of sex. Such claims of discriminatory overselection, however, require a two-tailed statistical analysis. Appellants may view consular assignments as inferior to political assignments, but another class of women plaintiffs could certainly bring a Title VII claim if women were intentionally underas-signed to the consular cone. Consequently, statistically significant deviations in either direction from an equality in selection rates would constitute a prima facie case of unlawful discrimination. Indeed, appellants’ own statistical expert testified that a two-tailed test was necessary in evaluating the disparity between men and women in assignments to the consular cone because the hypothesis to be tested is whether cone assignments are made without regard to sex. See Transcript (Tr.) at 1081.
We also think a two-tailed test of statistical significance should be applied to all of appellants’ discrimination claims in this case. First, Baldus and Cole originally noted the importance of consistency in evaluating statistical evidence. Second, although we by no means intend entirely to foreclose the use of one-tailed tests, we think that generally two-tailed tests are more appropriate in Title VII cases. After all, the hypothesis to be tested in any disparate treatment claim should generally be that the selection process treated men and women equally, not that the selection process treated women at least as well as or better than men. Two-tailed tests are used where the hypothesis to be rejected is that certain proportions are equal and not that one proportion is equal to or greater than *96 the other proportion. See Curtis, supra, at 119-22, 133-37.
Moreover, even if a disparity in only one direction is at issue in a particular Title VII case (e.g., only the underpromotion and not the overpromotion of women), we think that the more appropriate assessment of the probability that the contested disparity resulted from chance requires a recognition that a random disparity of equal magnitude, but in the opposite direction, is equally as likely. For example, if plaintiffs in a Title VII case come into court simply with evidence that women were underselected for a particular job, and that this disparity measured 1.75 standard deviations, it is perfectly true that the probability of women being underselected to this extent or more by chance is only 4%. Under a one-tailed test of statistical significance, employing the 5% level, as this court did in Segar, this evidence alone would establish a prima facie case of disparate treatment.
But for a disparity measuring 1.75 standard deviations it is equally true that the probability of a random deviation of this magnitude or larger, either underselecting or overselecting women, is 8%. In other words, disparities of this magnitude will be consistent with the hypothesis that the selection process did not treat men and women differently in 8% of the cases. Even if in the case before the court the disparity disfavors women and not men, how can the court ignore the possibility that the case might still be one of the 8% cases in which a fair selection process would by chance produce disparities in this magnitude or greater? Thus, we think a court should generally adopt a two-tailed approach to evaluating the probability that the contested disparity resulted by chance. Furthermore, although an 8% probability is pretty low, we do not think that it is low enough to establish by itself an inference of unlawful discriminatory animus. We think that statistical evidence must meet the 5% level referred to in Segar for it alone to establish a prima facie case under Title VII. Taken together, as we have said, a two-tailed test and a 5% probability of randomness require statistical evidence measuring 1.96 standard deviations. Consequently, if plaintiffs come into court relying only on evidence that the underselection of women for a particular job measured 1.75 standard deviations, it seems improper for a court to establish an inference of disparate treatment on the basis of this evidence alone. 9
Of course, plaintiffs in Title VII pattern and practice cases need not rely on statistical evidence alone. Because the ultimate issue in a disparate treatment case is whether the disparity resulted from unlawful discriminatory animus, plaintiffs may introduce any additional evidence which is probative on this issue. Thus, plaintiffs are in no way foreclosed from establishing an inference of discrimination simply because the contested disparity falls short of the 1.96 standard deviations mark when analyzed statistically. Obviously, to use an extreme example, if an employer admits under cross-examination that assignments for a certain position were based in large part on sex, it matters not that the observed underselection of women measures only 1.75 standard deviations. When plaintiffs in a Title VII pattern or practice case rely on evidence in addition to the evidence of the disparity itself, the issue for the trier of fact in determining whether the plaintiffs have established a prima facie case must be whether the totality of plaintiffs’ evidence (again including the evidence of the disparity itself) demonstrates that, *97 more likely than not, the disparity resulted from an unlawful discriminatory animus— just as the issue after all the relevant evidence has been introduced by both sides remains whether in light of the totality of the evidence, plaintiffs have shown that, more likely than not, the disparity resulted from discrimination. 10
B. The Applicability of Title VII to Any Personnel Action
A plaintiff may bring a Title VII claim for alleged discrimination with respect to any employment decision by an agency of the federal government. The statute itself states that “all personnel actions affecting employees or applicants for employment ... shall be made free from any discrimination based on ... sex.” 42 U.S.C. § 2000e-16. In the Foreign Service Act of 1980, Congress reiterated this requirement specifically for Foreign Service employment practices. 22 U.S.C. § 3905. 11 Moreover, in the 1980 Act, Congress specifically defined a “personnel action,” which must be free from sex discrimination, to encompass “(A) any appointment, promotion, assignment (including assignment to any position or salary class), award of performance pay or special differential, within-class salary increase, separation, or performance evaluation and (B) any decision, recommendation, examination, or ranking provided for under this chapter which relates to any action referred to in subparagraph (A).” Id. This language could hardly be more inclusive.
From this statutory language, two legal principles necessarily follow. First, appellants in this case may bring a disparate treatment claim regarding discrimination in any type of personnel decision regardless of whether or not that discrimination has an effect on other, arguably more important, personnel decisions. Thus, if the State Department has intentionally discriminated against women in certain types of assignment decisions, the State Department has violated 42 U.S.C. § 2000e-16 even if the State Department can prove that the unlawful discrimination in assignments did not adversely affect the opportunities of women for promotion in the Foreign Service.
It is beyond dispute that the State Department may not discriminate against women in making any kind of employment decision, and if the State Department breaches this requirement, appellants have a cause of action to vindicate their statutory rights. We note, as further support of our interpretation of 42 U.S.C. § 2000e-16, that the Supreme Court last Term interpreted an analogous Title VII provision applying to private employers to encompass a claim of sex discrimination for sexual harassment even if the sexual harassment caused no tangible or economic loss.
Meritor Savings Bank, FSB v. Vinson,
— U.S. -,
Second, and relatedly, if plaintiffs in a Title VII case claim discrimination in certain kinds of employment decisions, it is no defense that the government did not discriminate against women in other kinds of employment decisions. For example, if the State Department intentionally under-selected women for appointment as Deputy Chiefs of Mission (DCM), the State Department has violated 42 U.S.C. § 2000e-16 even if the State Department can prove that it did not discriminate against women in assignments to five other “high visibility” positions. Appellants need not allege or prove discrimination in assignments to other “high visibility” positions in order to maintain a cause of action with respect to discrimination in DCM assignments. As the Supreme Court has stated: “Of course, Title VII provides for equal opportunity to compete for
any
job.”
Teamsters,
Although under 42 U.S.C. § 2000e-16 appellants must not be required to prove discrimination in employment decisions other than the ones they are specifically contesting, the government is correct in arguing that evidence of nondiscrimination in those other employment decisions may be probative of whether intentional discrimination actually occurred in the contested employment decisions. For example, if an employer can demonstrate that it did not discriminate against women at several steps of a promotional ladder, that evidence, in some circumstances, may reasonably suggest that the employer did not discriminate in the step at issue either.
But courts must be especially careful in judging the relevance of this kind of evidence lest they contravene the legal rule that under 42 U.S.C. § 2000e-16 plaintiffs need not prove discrimination in personnel actions other than those specifically at issue. The evidence supporting an inference of unlawful discrimination in certain employment decisions may be sufficiently strong that evidence of nondiscrimination in other employment decisions cannot rebut this inference. Thus, in some cases the strength of appellants’ prima facie case is so great that even if they were to agree to a stipulation that sex discrimination did not occur in other employment decisions, their evidence as to the employment decisions specifically at issue would still prove that, more likely than not, unlawful discrimination occurred.
When all the evidence raising and rebutting the inference of discrimination is statistical, according the proper deference to each legal principle is a delicate task indeed. If Title VII plaintiffs are able to muster only the most marginal inference of discrimination in only one type of job decision (e.g., the underselection of women in one promotional class measures only 1.98 standard deviations), then an inference of discrimination may be undercut by the fact that women are demonstrably not underse-lected in other similar job decisions. But even here courts must be wary. Evidence that the underselection of women in another similar job decision measures just below the 1.96 threshold, while not sufficient to prove discrimination, is not compelling evidence that the employer did not discriminate in this other employment decision.
Thus, when plaintiffs in a Title VII case introduce statistical evidence of an extreme disparity in the selection rates for men and women for a certain type of job, the fact that these plaintiffs have insufficient evidence to establish an inference of discrimination regarding other employment decisions should not block an inference of discrimination on the specific type of employment decision at issue. For example, if Title VII plaintiffs present evidence that the underselection of women for a particular type of job assignment measures above 3.0 standard deviations, this evidence necessarily raises an inference of discrimina *99 tion in these assignments regardless of the statistical evidence concerning other assignments. The likelihood that this disparity in the selection rate for men and women is merely a random deviation in a selection process that treated men and women equally is simply too low (l-in-500 using a two-tailed approach) for statistical evidence regarding other assignment decisions to rebut this evidence. In these circumstances, the Title VII defendant must present evidence directly relating to the type of assignment at issue to explain the evident disparity in a legitimate, nondiscriminatory fashion. For a district court to reject plaintiffs’ claim of discrimination in such a case on the grounds that plaintiffs failed to raise an inference of discrimination in other job assignments would effectively amount to a requirement that plaintiffs prove discrimination in employment decisions other than those specifically at issue. And, as we have said, such a requirement would directly conflict with the express provisions of 42 U.S.C. § 2000e-16.
C. Rebutting the Inference of Disparate Treatment
As we have discussed, under Title VII courts will initially infer that a disparity between men and women in selection rates for a particular job or job assignment results from unlawful discrimination if the disparity is large enough:
i.e.,
measures at least 1.96 standard deviations. But defendants in Title VII cases must be offered an opportunity to rebut this inference by showing that the disparity, albeit nonrandom in cause, resulted from some legitimate, nondiscriminatory factor. Similarly, defendants must be allowed to rebut the inference of discrimination by, alternatively, challenging the statistical calculations upon which the inference of discrimination is based. For example, the statistics may rely on faulty data, flawed computations, or improper methodologies. A recent Supreme Court opinion provides courts with some guidance on how to treat attempts to attack an inference of discrimination based on statistical evidence alone.
See Baze-more v. Friday,
— U.S. -,
In Bazemore, the United States District Court for the Eastern District of North Carolina was presented with statistical evidence that black employees of the North Carolina Agricultural Extension Service received substantially lower salaries than white employees working in the same job positions. The District Court determined that “the statistical evidence of plaintiffs standing alone and without further explanation probably suffices to make out a prima facie showing of discrimination in salaries.” Civil Action No. 2879, Mem. Op. at 47 (August 22, 1982). The defendants in Bazemore, however, argued that plaintiffs’ statistics failed to account for several factors, any of which would provide a legitimate, nondiscriminatory explanation for the salary disparities. Id. at 48. The District Court agreed with the defendants, holding that because defendants had demonstrated that these other factors might have caused the salary disparities, defendants successfully rebutted plaintiffs’ inference of disparate treatment:
Having thoroughly considered all of the evidence bearing on the salary issue and the contentions of the parties based thereon, the court has concluded that if it be assumed that plaintiffs made out a prima facie case on this issue, it has only been by virtue of the plaintiffs’ statistical evidence ...; that because of their failure to include many of the vital factors necessary to be considered in fixing salaries the probative force of these statistics has been so substantially undermined that they cannot sustain a finding of purposeful discrimination in salaries ...; that the defendants have not only “articulated” plausible reasons for the seeming salary disparities, but have satisfied the court of the validity of their explanations____ It follows that plaintiffs have failed to establish by a preponderance of the evidence that the Extension Service has discriminated against black employees in the matter of salaries.
Id. at 54-55 (citation and footnotes omitted).
*100
The Fourth Circuit affirmed this determination by the District Court in
Bazemore. See
The Supreme Court reversed. In a unanimous opinion for the Court, Justice Brennan responded to the Fourth Circuit’s “plainly incorrect” approach to statistical evidence:
Importantly, it is clear that a [statistical] analysis that includes less than “all measurable variables” may serve to prove a plaintiff’s case. A plaintiff in a Title VII suit need not prove discrimination with scientific certainty; rather his or her burden is to prove discrimination by a preponderance of the evidence.
Elsewhere in the opinion, Justice Brennan makes plain that the determination by the District Court whether discrimination exists or not “is subject to the clearly erroneous standard of appellate review.”
Id.
at 3008. While the Supreme Court remanded the case to the Fourth Circuit to definitely determine whether “based on the entire evidence in the record,” the District Court’s decision had been clearly erroneous, the Justices did declare, “we think that consideration of the evidence makes a strong case for finding the District Court clearly erroneous.”
Id.
at 3010-11 (footnote omitted). Rather than viewing the inclusion of “pre-Act” salaries in the statistical study as rendering the study fatally flawed, the Supreme Court stated that “evidence of pre-Act discrimination is quite probative.”
Thus,
Bazemore
instructs lower courts to be cautious about dismissing plaintiffs’ statistical studies as not probative simply because defendant offers some nondiscriminatory explanation for the disparities shown. Implicit in the
Bazemore
holding is the principle that a mere conjecture or assertion on the defendant’s part that some missing factor would explain the existing disparities between men and women generally cannot defeat the inference of discrimination created by plaintiffs’ statistics. To be sure, as the Supreme Court acknowledged in
Bazemore,
there may be a few instances in which the relevance of a factor to the selection process is so obvious that the defendants, by merely pointing out its omission, can defeat the inference of discrimination created by the plaintiffs’ statistics.
See
This court, even before Bazemore, had explicitly endorsed the same principle, most recently in a situation where the government attempted to rebut the inference of discrimination arising from evidence that blacks in the Drug Enforcement Agency were paid less and promoted less rapidly than whites. The government argued that blacks were less likely than whites to have an extra year of “specialized experience” over and above minimal qualifications. We rejected the argument because the DEA failed to introduce any evidence to substantiate its assertion:
Since DEA has presented no admissible evidence that black agents are more likely than white agents to lack a second year of requisite experience, plaintiffs’ failure to account for this variable does not dilute the force of their statistical analysis; ... absent any reason to conclude that the. omitted factor correlates with race, the omission of this variable will not affect the validity of the race coefficient in the plaintiffs’ regression analysis.
Segar,
IV. A Review op the Disparate Treatment Claims in This Case
Having discussed the applicable legal principles, we now address the specific disparate treatment claims at issue in this case. Supreme Court precedent has made plain the appropriate standard for reviewing a district court’s determination that employment decisions were not the product of an unlawful discriminatory animus. We can reverse this factual finding only if it is clearly erroneous in light of all the evidence in the record or if it rests on legal error.
See Bazemore v. Friday,
— U.S. -,
A. Promotions and Evaluations
The Secretary of State argues that appellants’ claim of “class-wide promotion discrimination lie[s] at the heart of this case.” Appellee’s Brief at 58. We agree.
Appellants claim that the State Department discriminated against women in promoting FSOs from class 5 to class 4 from 1976 to 1983. According to the government’s own evidence, fewer women than expected were actually promoted to class 4 during that time period, given the number of promotion-eligible women in class 5. The government’s own statistical analysis, whose methodology the District Court found to be more accurate than appellants’, concluded that the discrepancy between the actual and expected number of women promoted measured 1.76 standard deviations.
See
For the reasons set forth in Part III. A., we do not think this evidence alone is sufficient to prove an intent to discriminate against women. Appellants at trial, however, relied on additional evidence to prove a discriminatory motive. Appellants first point to evidence in the record of a general prejudicial attitude against women within the Foreign Service during this time period and argue that this evidence supports the proposition that the discrepancy between the actual and expected number of women promoted to class 4 results from a prejudicial attitude against women that violates Title VII.
This evidence includes statements made upon cross-examination by the defense witness, Benjamin Reid, who was Undersecretary of State for Management from 1977-1981. Reid testified that the Foreign Service, as a result of traditionally being “white, male, and Ivy League,” had “set ways of doing things” and that although during his tenure the Foreign Service “had come a long way,” it nevertheless “still had a long way to go” at the time he left in correcting these biased attitudes. Tr. at 3279-80. Similarly, the appellants introduced into evidence a report written in 1977 by a committee within the State Department asserting that “both attitudinal resistance to equal employment opportunity and discriminatory behavior are still widespread in the Department.” Plaintiffs’ Exhibit 29 at 6. The appellants also introduced into evidence a report published in 1984 by the Women’s Research and Education Institute of the Congressional Caucus for Women’s Issues, which stated that “ ‘what some identify as traditional elitist attitudes have [worked] to limit severely employment opportunities for women and minorities [in the Foreign Service].’ ” Plaintiffs’ Exhibit 88 at 10 (quoting a 1981 report prepared by the U.S. Commission on Civil Rights).
More specifically, as proof that the un-derpromotion of women FSOs from class 5 to class 4 resulted from a prejudicial attitude against women, the appellants relied upon evidence that the State Department believed that women FSOs had less potential for advancement than men FSOs even though men and women FSOs performed their duties with the same skill. A random sample of the evaluation reports for over 400 FSOs in classes 5 and 6 revealed that although “there was no significant difference in the
performance
ratings of men and women, ... the disparity between men and women [in their
potential
ratings] measured 2.49 standard deviations.”
The relevance of this evidence to whether the underpromotion of women from class 5 to class 4 resulted from a discriminatory attitude against women is obvious. As the State Department itself asserted and the District Court expressly found, competitive promotion decisions in the Foreign Service were based primarily on an “assessment of the officer’s potential to perform at the next higher level.”
The District Court, however, never considered the evidence of a discriminatory attitude about the potential of women derived from the evaluations in deciding whether appellants had proved, by a preponderance of all the evidence, discriminatory intent in the decisions pertaining to promotions from class 5 to class 4. Rather, the District Court offered the following grounds for rejecting the evidence relating to the evaluation reports:
In view of the finding that female FSO’s are promoted equally with male and given the same job opportunities, the Court finds that plaintiffs’ analysis of the disparity on potential ratings does not establish that the [evaluation reports] of female FSO’s are discriminatory in any fashion.
In our view this reasoning puts the cart before the horse. The District Court cannot determine that the State Department did not discriminate against women in promotions from class 5 to class 4 until it considers whether or not all the evidence demonstrates a biased attitude towards women and their capabilities. It cannot reject relevant evidence of discriminatory intent on the basis of a conclusion that no discrimination occurred without reference to the relevant evidence. To rule otherwise would convert Title VII into a Catch-22: in order to establish a promotional disparate treatment claim, a plaintiff must prove discriminatory intent; but she cannot offer proof of discriminatory intent in the form of disparate ratings between men and women as to their potential unless she has already established a promotional disparate treatment claim. We hold that appellants were entitled, as a matter of law, to have the District Court consider evidence in the ratings of a discriminatory attitude about the potential of women when evaluating appellants’ disparate treatment claim concerning promotions from class 5 to class 4. Conversely, it was an error of law for the District Court to “reason” backwards and dismiss appellants’ claim that the disparity in potential ratings was a violation of Title VII on the grounds that the court had already determined that the State Department did not discriminate against women in promoting FSOs from class 5 to class 4.
Thus, we reverse both the District Court’s decision that the State Department did not discriminate against women in evaluating the potential of FSOs and its decision that there was no discrimination shown in promoting FSOs from class 5 to class 4. Following the command of
Pullman-Standard v. Swint,
Upon remand the District Court must consider whether, on the basis of the existing record, the evidence pertaining to the disparity in potential ratings, together with the nonstatistical evidence of a generally hostile attitude against women in the Foreign Service and the statistical evidence of the disparity in class 5 to class 4 promotions, is sufficient proof that, more likely than not, the underpromotion of women from class 5 to class 4 was based on discrimination. The evidence in the record cutting the other way is the failure of the appellants’ statistical evidence to make out even a prima facie case that the State Department discriminated against women at other grades of the promotional process. Of course, as we have pointed out, appellants need not prove discrimination in these other promotion decisions in order to prevail in their disparate treatment claim concerning promotions from class 5 to class 4. Indeed, it is quite plausible that a discriminatory attitude about women and their potential for further advancement might affect promotions only at a mid-level step— like the transition from class 5 to class 4. First of all, as we discussed in Part I. A, supra, the promotions in the junior ranks (classes 7 and 8) were noncompetitive. Second, the Secretary’s own statistical analysis showed that fewer women than one would expect were actually promoted from class 6 to class 5, although his study indicated that this disparity was just as likely to be a random deviation in a nondiscriminatory system as a symptom of discrimination. See Defendant’s Exhibit 8A, Table 1, Model 2. Finally, one might surmise that those women who survive a discriminatory bias in critical mid-level promotion decisions have demonstrated such superior skill and aptitude that they would encounter less resistance to advancement in upper level positions. Despite all these considerations, the District Court is entitled to determine for itself on remand whether the government’s evidence of nondiscrimination at other promotional levels is sufficient to outweigh the appellants’ evidence, which as we have said includes three distinct elements: the disparity itself measuring 1.76 standard deviations, testimony and documented evidence of a general bias against women in the State Department, and the specific evidence as to discriminatory attitudes about the potential of women FSOs for future advancement, revealed in the evaluation reports of class 5 and 6 FSOs.
With respect to the evaluation reports, we note that the District Court committed a further error of law. In discussing the appellants’ statistical analysis of the potential ratings for men and women, the court stated that:
The methodology utilized by plaintiffs’ expert ... fails to allow for one vital characteristic, that being female FSO’s have less time in class than males. This inexperience would account for the lower potential ratings when compared with males who have more time in class____
While the actual performance of males and females may not be reflected by this inexperience, a subjective judgment on the potential capacity of an FSO may certainly be affected by such inexperience resulting from less time in class.
There was, in fact, no evidence whatsoever introduced at trial on which the District Court could rely to base its assumption that despite equivalence in actual performance officers with less experience would be viewed as having lower potential than those with more experience. See Appellants’ Brief at 42. Moreover, the District Court’s assumption is counterintuitive: if officers with less experience managed to perform at the same level as officers with more experience, one would expect that the less experienced officers would be seen as quick learners with more, not less, potential. In any event, the District Court was not entitled to rely on mere conjecture to undercut the probative force of appellants’ statistics. See, supra, Part III.C. On remand, in deciding whether appellants’ evidence concerning the evaluation reports demonstrated a bias against women, the *105 District Court shall not rely upon any unsupported hypotheses, such as the relatively lower number of years experience of women in grade.
We note further that, even if the rating evidence proves insufficient to prove a discriminatory motive in promotions, appellants are entitled, as a matter of law, to bring an independent claim of disparate treatment with respect to the evaluation reports themselves. As we have seen, the Foreign Service Act of 1980 specifically includes any “evaluation” as a “personnel action” that must be free from discrimination. In light of this express statutory language, we cannot but read the words “all personnel actions” in 42 U.S.C. § 2000e-16 as encompassing such a claim. Thus, under Title VII, the State Department may not discriminate against women in their evaluations regardless of any demonstrated effect the evaluations ultimately can be shown to have on promotion opportunities. We need not now consider what remedy might be appropriate for discriminatory evaluations; the parties bifurcated the issues of liability and remedies.
To recapitulate, insofar as the District Court required appellants to prove discrimination in promotions in order to prove discrimination in evaluation reports, the District Court erred as a matter of law in two significant respects. First, the District Court unreasonably rejected a major portion of appellants’ evidence that the promotion decisions at issue were infected with a discriminatory motive. Second, the District Court deprived appellants of their right under Title VII to bring a disparate treatment claim as to evaluations, regardless of how those evaluations might affect other employment decisions. Consequently, we remand to the District Court both the issue of whether the State Department discriminated against women in its decisions concerning promotions from class 5 to class 4 and the issue of whether it discriminated in its evaluations of the future “potential” of women FSOs.
B. Assignments
Appellants brought disparate treatment claims with respect to various types of Foreign Service assignment decisions. We consider first appellants’ claim that the State Department discriminated against women in “out-of-cone” assignments by overassigning women to positions in the consular cone and by underassigning women to the “prestigious” program direction cone.
1. Out-of-cone assignments
The District Court found that appellants’ evidence disclosed the following facts about out-of-cone assignments to the consular cone:
a) Between 1976 and 1983, 40.4 percent of all out-of-cone assignments received by women in the political cone were to consular positions, while only 15.5 percent of the out-of-cone assignments received by men in the political cone were to consular positions. This difference [measures 5.84 standard deviations and therefore the probability of a disparity of this magnitude or greater (either overse-lecting or underselecting women) resulting by chance is less than one in one hundred million]. 16
b) For the same time period, the plaintiffs’ statistics show 22.9 percent of all out-of-cone assignments received by women in the economic cone were to consular positions, while only 11.6 percent of all out-of-cone assignments received by men in that cone were to con *106 sular positions. This difference measures 2.68 standard deviations [which means the probability of women being randomly overassigned or underassigned to this degree or greater is 0.74 percent]. 17
c) During the same time period, plaintiffs’ analysis indicated that 50.8 percent of all out-of-cone assignments received by women in the administrative cone were to the consular cone while only 33.2 percent of all out-of-cone assignments received by men were to the consular cone. This difference measures 2.62 standard deviations [which means that the probability of a disparity of this magniude or greater resulting by chance is 0.88 percent]. 18
The [plaintiffs’ statistical] analysis does not account for the unique feature of the FSO’s bidding, or requesting, their assignments pursuant to the Open Assignment Policy. A more accurate analysis would measure the requests by the FSO’s, as the observations made by plaintiffs’ expert may result as much from the function of requesting different assignments as the assignment of FSO’s. Id. at 1554 (If 101). On this basis, the District Court found appellants’ statistical evidence “unconvincing” and concluded that appellants had failed to prove sex discrimination in out-of-cone assignments to the consular cone. Id. at 1560 (U 22).
It is true, as the District Court pointed out, that assignments are made in part pursuant to the bid lists submitted by members of the Foreign Service. But as the District Court acknowledged, bid lists were only one element of the assignment process, and the selection boards based their assignment decisions in larger measure on the perceived needs of the bureaus to which the assignments were made. See, supra, Part I.A. Moreover, the Secretary submitted no evidence showing that more women than men preferred out-of-cone assignments to the consular cone. Appellants’ Brief at 55. The Secretary, on appeal, concedes as much.
The Secretary, however, would have us affirm the District Court’s decision on the grounds that “an analysis which ignores ‘preference’ ... is simply not probative on this issue.” Appellee’s Brief at 55. This argument, however, is precluded by the Supreme Court’s Bazemore decision. According to Bazemore, appellants’ statistical evidence concerning out-of-cone assignments to the consular cone is probative of discrimination despite the fact that it did not include individual preferences as a possible explanatory factor. There was no basis in the record on which the District Court could assume that women indicated preferences for consular work more frequently than men did. Consequently, the District Court contravened the dictates of Bazemore by refusing to credit the appellants’ statistical evidence. Under Baze-more and Segar, the District Court is not entitled to dismiss plaintiffs’ statistical evidence on mere conjecture. 19
*107
As a result of this legal error, “unless the record permits only one resolution of the factual issue,” we must remand the issue of out-of-cone assignments to the District Court.
Pullman-Standard,
With respect to out-of-cone assignments to the program direction cone, the District Court found that appellants’ evidence showed that “38.5 percent of all out-of-cone assignments received by men in the political cone were to senior program direction cone positions, while only 14.6 percent of the out-of-cone assignments received by women in the political cone were to program direction cone positions.”
Appellants’ evidence also demonstrated that “12.4 percent of the out-of-cone assignments received by men in the consular cone were to program direction positions, while only 6.6 percent of the out-of-cone assignments received by women in the consular cone were to program direction positions.”
The appellants argued that this underas-signment of women to program direction cone positions from the political and consular cones resulted from the discriminatory belief within the Foreign Service that women were unsuitable for prestigious leadership-track positions. It is unclear from the District Court’s opinion why the District Court rejected this argument, and found, to the contrary, that the State Department did not discriminate against women in assignments from the political and consular cones to the program direction cone. The District Court did observe that “Defendant’s expert produced an analysis indicating that, as to those men and women who did attain transfer to the Program Direction cone, there was no disparity in the amount of time spent in class before attaining the transfer.”
Despite the District Court’s concession that appellee’s rebuttal evidence could not be “dispositive,” it offered no other basis for rejecting appellants’ claim of discrimination in out-of-cone assignments to the program direction cone positions. Specifically, it did not mention individual preference as a possible nondiscriminatory explanation for the disparity between men and women in their selection rates for these positions, probably because there was absolutely no evidence in the record indicating that women preferred assignment to the “prestigious” program direction cone less than men.
Thus, we conclude that the District Court failed to articulate any sufficient grounds for rejecting appellants’ proof of discrimination in out-of-cone assignments to the program direction cone. The sole basis offered by the government was properly found by the court to be insufficient. It cited no other basis in the record for its decision, and we can find none. Therefore, we reverse and remand the issue for reconsideration, on the basis of the existing record. The inference of discrimination raised by the significant disparities between men and women given out-of-cone assignments to these “prestigious” positions is thus far unrebutted. Unless the District Court can find valid basis supported in the record for rejecting the inference of discrimination, it must rule in favor of the appellants on this claim.
2. Stretch and Downstretch Assignments
The appellants also claim that the State Department discriminated against women in “stretch” and “down-stretch” assignments. The evidence that appellants introduced at trial in support of this claim included the following statistics. First, between 1976 and 1981, “32.2% of the women in Class 4 were given downstretch assignments, while only 17.6% of the men in that class were given down-stretch assignments.”
Second, “20.8% of the women in Class 5 received down-stretch assignments, while only 14.2% of the men received them. This difference measures 4.04 standard deviations.”
Third, 19.9% of the women in class 7 received down-stretch assignments, whereas only 14.3% of the men in class 7 did. This disparity measured 2.39 standard deviations, which corresponds to a (two-tailed) probability value of about 1.6%. See Plaintiffs’ Exhibit 57; Elementary Statistics, supra n. 8, at 479.
Fourth, with respect to stretch assignments, only 19.1% of women in class 4 received stretches, whereas 28.4% of the men in class 4 did. This underselection of women measured 3.74 standard deviations, which means that the probability of either an underselection or overselection of women of this magnitude or larger resulting from chance is about one in 5,000. See Plaintiffs’ Exhibits 57, 168.
Fifth, only 31.6% of women in class 5 received stretch assignments, whereas 37.7% of the men in class 5 did. This disparity measured 2.79 standard deviations, which corresponds to a (two-tailed) probability value of 0.52%. See Plaintiffs’ Exhibit 57; Elementary Statistics, supra, n. 8, at 479.
The appellants argued that this overas-signment of women to downstretch positions and underassignment of women to stretch positions resulted from unlawful sexist attitudes in the Foreign Service. As additional evidence to support their contention, the appellants pointed to a 1977 report prepared within the State Department, which stated that stretch assignments “are not commonly given to those in EEO categories,” meaning women and minorities. *109 Plaintiffs’ Exhibit 29 at 6. The District Court nonetheless rejected the appellants’ claim, offering several reasons for its decision. These reasons, however, do not support the District Court’s decision. All but one are erroneous as a matter of law, and the other is a clearly erroneous finding of • fact.
First, the District Court stated that appellants had failed to show that the overassignment of women to downstretch positions and underassignment of women to stretch positions adversely affected the opportunities of these women for promotion.
See
Second, the District Court concluded that appellants’ statistical evidence was “of little value in persuading that discrimination existed in assigning stretch and down-stretches” because, in part:
Plaintiffs’ expert, by analyzing the situation class by class, appears to ignore cross-class competition for any given assignment. For example, an officer vying for a Class 4 stretch position may compete against officers from at least Classes 6, 5, 4, and 3.
While it is absolutely true that officers in any given class will be competing against officers from other classes, it is also absolutely irrelevant to the point of appellants’ evidence. Appellants are trying to demonstrate, for example, that women in class 5 are less likely than men in class 5 to stretch into assignments labelled class 4 or higher, and that this disparity results from a widespread prejudice within the Foreign Service that women are less able than men despite their equivalent rank. Given this purpose, it is entirely irrelevant that officers from other classes may compete with men and women in class 5 for those assignments that are stretches for officers in class 5. Appellants are not interested in comparing how well the men and women in class 5 compete against officers in another class. They are only interested, and properly so, in how similarly situated men and women compete against each other.
It was an error of law for the District Court to reject the probative value of appellants’ statistical evidence because of this irrelevant factor of “cross-class competition.” Certainly, the Supreme Court’s decision in
Bazemore
stands for the proposition that the “missing factor” identified by the District Court as a reason for discounting statistical proof of disparate treatment must at least be relevant to the point of the statistics. In
Bazemore
itself, the Supreme Court noted that “certain conclusions of the District Court are inexplicable in light of the record.”
the District Court complained about the inclusion of the County Chairman in the petitioners’ regression analysis, fearing that the fact that they were disproportionately white would skew the salary statistics to show whites earning more than blacks. Yet, because the regressions controlled for job title, adding County Chairman as a variable in the regression would simply mean that the salaries of white County Chairmen would be compared with those of nonwhite County Chairmen.
Id. In this case, the District Court’s reliance on the omission of “cross-class competition” as a basis for rejecting appellants’ evidence of discrimination in stretch and downstretch assignments is similarly “inexplicable.”
Third, the District Court found appellants’ statistics concerning stretch and downstretches to be “flawed” in another respect. The data from which the statistical analysis was made was tabulated in terms of the total number of years each FSO served in a stretch or a downstretch assignment rather than in terms of the number of such assignments. The District Court found that this methodology "does
*110
not accurately reflect the number of assignments given out by the Foreign Service.”
Finally, the District Court found that “Plaintiffs’ analysis did not allow for the preference of the individual FSO.”
*111 3. Deputy Chief of Mission Assignments
Appellants also claim that the State Department discriminated against women in selecting Deputy Chiefs of Mission. The Deputy Chief of Mission (DCM) is the second in command, directly below the Ambassador, at each American embassy. As the District Court found, appellants introduced evidence showing that only “nine women were appointed DCM between 1972 and 1983, out of a total of 586 appointments.”
Plaintiffs’ expert calculated that the expected number of women appointed during that period, based on the number of women in the grade levels from which DCM’s were chosen, is 26.8. The difference between the actual and expected number of women measures 3.54 standard deviations.
Id. The probability of a disparity this large or larger, either favoring or disfavoring women for the DCM position, resulting by chance in a selection process that did not differentiate between men and women, is about one in 2,500 times. Given this extremely low probability, this evidence, standing alone, raises a strong inference of disparate treatment.
The District Court offered several reasons for concluding that the State Department did not discriminate against women in DCM assignments. All of these reasons are erroneous as a matter of law. First, the District Court found this evidence “unconvincing” because appellants were unable to show “statistically significant disparities]” in the selection rates for five other “high visibility positions.”
Once more, we remind that under 42 U.S.C. § 2000e-16 appellants are not required to prove sex discrimination in assignments to six different types of jobs in order to establish discrimination in assignments to a single position. We have, however, also said that evidence of nondiscrimination in some jobs may be probative of whether discrimination occurred in selections for another kind of job. Adherence to both these legal rules may be difficult at times. But in this case it is clear that the District Court contravened the first of these two legal rules. Here, appellants introduced evidence showing that the un-derselection of women for DCM positions was so extreme that the chance of women being randomly underselected or overse-lected to this degree or greater was only one in 2,500 times. Not even a stipulation that the State Department did not discriminate against women in assignments to five other kinds of “high visibility” positions could defeat the inference of disparate treatment raised by this evidence. A defendant must produce other evidence directly relating to the job at issue to rebut this inference of discrimination. In this case, the District Court rejected appellants’ strong inference of disparate treatment in part because appellants did not generate an inference of discrimination in five other types of assignments. This was legal error.
Second, the District Court stated: Plaintiffs’ analysis of the number of women ... in DCM positions failed to adequatély consider the bottom-entry nature of the Foreign Service. It failed to allow for the time necessary for the large number of female FSO’s presently in the service to advance to the higher ranks.
Third, the District Court found that “[plaintiffs’] statistical analysis is of little significance in that it encompasses the period 1972 through 1983, while the relevant time period for this case is 1976 to 1983.”
Thus, the three reasons the District Court gave for rejecting appellants’ strong inference of disparate treatment in DCM assignments are inadequate as a matter of law. On appeal, the Secretary suggests an alternative nondiscriminatory explanation for the underselection of women to this position: more women might have been appointed Ambassador instead. Appellee’s Brief at 57. We note that the District Court made no such finding and the only evidence in the record to which the Secretary directs us is a statement by a single witness that perhaps this fact might explain the underselection of women for DCM positions. Tr. at 1766. We think that the proper course under Pullman-Standard is to remand the issue to the District Court for further factfinding, on the basis of the existing record.
C. The Superior Honor Award
The appellants also claim that the State Department discriminated against women in granting the Superior Honor Award to Foreign Service Officers. As the District Court found, appellants presented the following evidence:
4.8% of the award recipients were females, although 10.1% of the Class 1 through 5 FSO’s during the time period were females. These results indicate that twice as many women would be expected to receive the Superior Honor Award as actually received it. The difference measures 3.1 standard deviations.
Once again, the reasons that the District Court gave for rejecting appellants’ discrimination claim are contrary to law. First, the District Court stated that appellants failed to show how “the failure of women to receive the Superior Honor Award affected the opportunity for promotion.” Id. (H 49). Appellants, however, are entitled to bring a sex discrimination claim under 42 U.S.C. § 2000e-16 with respect to personnel decisions involving awards regardless of how these decisions affect promotions. As we have seen, the Foreign Service Act of 1980 specifically includes “any ... award of performance pay or special differential” as among the personnel actions that must be free from sex discrimination, and we do not construe “all personnel actions” in 42 U.S.C. § 2000e-16 to have a lesser scope.
Second, the District Court rejected appellants’ claim involving the Superior Honor Award as “unconvincing” because the appellants were unable to produce equivalent evidence with respect to other State Department Honor Awards. But as with the evidence concerning the DCM assignments, appellants’ evidence concerning the Superi- or Honor Award is sufficiently strong to withstand even a stipulation that the State Department did not discriminate against women in granting other types of Honor Awards. To rebut the inference of discrimination here, the State Department was required to present evidence explaining the extreme disparity between the numbers of men and women receiving the Superior Honor Award.
Third, the District Court discredited appellants’ evidence because the District Court thought that appellants’ statistical “analysis was based on a faulty assumption that all female FSO’s were equally qualified for the Superior Honor Award.”
Moreover, because the State Department did not offer any explanation for the disparity between men and women in receiv *114 ing the Superior Honor Award, we must order the District Court to uphold appellants’ claim of discrimination on this issue. We need not address what kind of remedy might be appropriate, as only issues of liability are properly before the court at this time.
V. Initial Cone Assignments: The Claim Involving the Disparate Impact Theory
Appellants characterize their claim concerning initial cone assignments as both a disparate treatment and a disparate impact claim. This characterization, unfortunately, lacks a certain degree of clarity and may indicate some confusion on the appellants’ part. Perhaps this confusion stems from the fact that the initial cone assignments involve two distinct groups of FSOs: those that took entrance exams and those that did not. See, supra, Part I & n. 2. It appears that appellants wish to bring a disparate treatment claim on behalf of both these groups and a disparate impact claim on behalf of the exam-takers. The appellants introduced statistical evidence of a disparity in initial cone assignments for which the pool was both the exam-takers and the nonexam-takers. Appellants’ Brief at 22. This study was based on data supplied by the State Department. Id. The appellants also introduced statistical evidence of a disparity in the initial cone assignments for the exam-takers alone. Id. at 24. This study, by contrast, was based on data supplied by the Educational Testing Service (ETS) which administers the Foreign Service entrance exams and monitored the test results. Id. (The appellants apparently did not introduce any evidence regarding the nonexam-takers alone.) We do not believe, however, that in this case the appellants can pursue both a disparate treatment and a disparate impact claim with respect to the exam-taker’s initial cone assignments. We will explain our reasons for this conclusion.
To apply the disparate
treatment
theory to the evidence concerning exam-takers, the appellants must allege and prove that the observed, nonrandom disparities were caused by intentional discrimination against women. To apply the disparate
impact
theory, the appellant must allege and prove that the disparities were caused by a “facially neutral” selection criterion that disadvantaged women more than men. Here, the appellants point to the political functional field portion of the Foreign Service Entrance Examinations. They have introduced evidence that from 1975 to 1980 men received higher scores than women on this test and that statistical analysis rejects the hypothesis that this disparity was a random sample of the deviation that would normally occur if men and women tested equally.
See
Of course, the appellants might have presented alternative claims: e.g., the disparity in initial cone assignments was caused either by discriminatory intent, or by the results of the entrance examinations. Nothing in Title VII or the Federal Rules of Civil Procedure prevents appellants from pursuing alternative claims or theories, even if they are mutually inconsistent. 21 But in this case appellants seem to argue only that the results of the entrance examinations caused the disparity in initial cone assignments; they make no explicit charge of discriminatory intent. Indeed, appellants introduced an additional regression analysis study (also based on the ETS data) which showed that the test *115 scores were the one and only factor that explained the disparity in initial cone assignments. 22 At trial, appellants’ expert witness, who had conducted the statistical study, testified that with respect to “the exam takers, the reason you see this pattern [of disparity in initial cone assignments] is because of their test scores.” Tr. at 3402. The appellants argued to the District Court that this evidence demonstrates that “[t]he adverse impact of the functional field test causes the disparities in cone assignment observed by Dr. Siskin [the expert witness]____ [T]est scores on the functional field test were determinative of cone assignments.” Plaintiffs’ Post-Trial Brief at 33. They repeat this argument on appeal. Appellants’ Brief at 35. Because appellants have specifically identified the examinations, and not intent, as causing the disparity in initial cone assignments of the exam-takers, we will treat their claim concerning this disparity as relying solely on the disparate impact theory. 23
Once over that initial hurdle, the resolution of appellants’ disparate impact claim seems straightforward. The only basis which the District Court gave for rejecting appellants’ statistical evidence that correlated test scores with initial cone assignments was that these statistics were “flawed and inconclusive.”
Plaintiffs’ analysis of exam takers is flawed and inconclusive in establishing disparate impact in cone assignments. It was established that the expert’s determination of total FSO hires for the year 1981 was incorrect. Plaintiffs’ expert at times had difficulty identifying the cone at hire of the FSO’s and chose to delete those officers from the analysis, along with any FSO’s not assigned to the four major cones. Though the expert disclaimed the significance of those actions, the Court is not persuaded.
Id. at 1546 (H 29). Unfortunately, this finding of fact is itself flawed. Although the District Court is correct in saying that there was some confusion about the correct data for 1981 in some of appellants’ statistics, this confusion did not involve the specific statistical studies relevant to the disparate impact claim involving the entrance examination: the data which were supplied by ETS. There was no dispute about the accuracy of this data. The confusion over the 1981 numbers arises from data supplied by the State Department’s employment records. The State Department data were used in appellants’ statistical studies involving both exam-takers and nonexam-takers and this evidence was unnecessary for the disparate impact claim involving exam-takers only. 24
*116
Because the ETS data on which the disparate impact claim relies do not include the “flaw” referred to by the District Court, this finding of fact must be reversed as clearly erroneous. Indeed, the State Department makes no attempt to support this finding of fact. Instead, the State Department suggests that preference, and not the results from the functional field portions of the entrance examinations, explains the disparity in the initial cone assignments of male and female exam-takers. It is not at all clear from the opinion that the District Court adopted this argument. The District Court refers to the existence of a study that the State Department introduced in support of this argument, but makes no evaluation of the study.
Notably, the one obvious defense that the State Department never raised was that there was a legitimate “business” necessity for the test. Indeed, the District Court specifically found that “[d]efendant did not rely on a showing that the political functional field test was job related.”
Conclusion
We have reviewed the District Court’s decision in this case in detail and have concluded that it committed a number of legal errors and made several clearly erroneous errors of fact. Consequently, we reverse the judgment of the District Court and remand this action for further proceedings not inconsistent with this opinion. With respect to a number of the appellants' claims, we have held that the determination of liability under Title VII requires further factfinding by the District Court, to be conducted on the basis of the existing record. See C. Wright & A. Miller, Federal Practice and Procedure § 2577 (1971). We offer no views at this point on any issues relating to the remedies phase of this litigation.
It is so ordered.
Notes
. Before 1975, the Foreign Service tested each applicant in only one of the four functional areas, and required the applicant to select the cone in which he or she wished to be tested. Defendant’s Post-Trial Brief at 40. From 1975 to 1979, applicants were admitted into the Service on the basis of general test scores alone; the results of the functional field cone tests were used to make initial cone assignments. Since 1980, admission has’ depended upon overall performance on the functional field tests, but applicants must achieve a certain cut-off score on the particular cone test in order to be eligible for appointment to that cone. Id. at 43.
. Another relatively small group have entered the junior ranks of the Foreign Service without going through the examination process. Below 1984, minority applicants who entered the Foreign Service through the Affirmative Action Junior Officer Program were not required to take the entrance examinations. Similarly, the Mustang Program, which allows State Department employees not in the Foreign Service to become members of the Service, has not used the examination. Individuals who have entered the Service pursuant to these programs have received initial cone assignments based on their background and experience.
. The decision to grant a career candidate tenure as a Foreign Service Officer is made independently of the promotion process. Tenure decisions are made by the Secretary of State pursuant to 22 U.S.C. § 3946, which provides that the Secretary's decisions shall be based on the recommendations of special tenure boards.
See
Defendant's Post-Trial Brief at 100-02;
see also Daniels v. Wick,
. The Junior Applicant Consent Decree settled all claims involving entry-level decisions into the junior ranks of the Foreign Service. The Mid-Level Applicant Consent Decree settled all issues of lateral entry into the Foreign Service.
. The appellants have not appealed all issues raised at trial.
. As the quotation from
Segar
reflects, the statistical analysis must focus "on the appropriate labor pool” in order to properly establish a prima facie case of discrimination. If a statistical analysis of selection rates is premised on a faulty calculation of the number of men and women who are eligible for selection, as a result, for example, of a misunderstanding of the eligibility criteria, the statistical conclusions lose much of their probative force. If, for instance, to be eligible for a promotion from assistant professor to professor at a particular university a person must have seven years experience and a Ph.D. degree, a statistical study which defines the number of women and men eligible for this promotion as those with seven years experience, overlooking the requirement of a Ph.D. degree, might lead to skewed results, for there might well be some reason why more female than male assistant professors had not achieved a Ph.D. degree after seven years of teaching. “In order to ensure that a plaintiffs methodology has eliminated the common nondiscriminatory explanation of a lack of qualifl-cations, this circuit has developed a requirement that statistical evidence of disparities account for the
minimum objective qualifications
for the position at issue.”
Segar v. Smith,
. The “standard deviation” is a unit of measurement that allows statisticians to measure all types of disparities in common terms. Technically, a "standard deviation” is defined as "a measure of spread, dispersion, or variability of a group of numbers equal to the square root of the variance of that group of numbers." D. Baldus & J. Cole, Statistical Proof of Discrimination 359 (1980) (emphasis in original). The "variance" of the group of numbers is computed by subtracting the “mean," or average, of all the numbers, “squaring the resulting difference, and computing the mean of these squared differences.” Id. at 361.
. The discussion of statistics in this portion of the opinion relies on the following sources: D. Baldus & J. Cole, Statistical Proof of Discrimination (1980 & 1986 Supp.); W. Curtis, Statistical Concepts for Attorneys (1983); W. Dixon & F. Massey, Jr., Introduction to Statistical Analysis (4th ed. 1983); B. Lindgren & D. Berry, Elementary Statistics (1981) [hereinafter cited as Elementary Statistics]-, R. Wehmhoefer, Statistics in Litigation (1985).
We are not expert statisticians and we discuss statistics only insofar as necessary to give a comprehensible explanation of our view of the proper application of Title VII law to the facts of this case. Nor do we pretend to cover all of the issues that relate to the use of statistics in a Title VII case. For example, we note that there are various methods for deriving a "test statistic” measured in numbers of “standard deviations”: the z-test, the t-test, etc. We have no opinion on the choice of these methodologies as this case does not call them into question. Similarly, we are aware that our discussion of statistics requires sufficiently “large” samples in order to be accurate; we have avoided the “small sample problem” because apparently none of the claims on appeal here involves small samples.
. In any event, given the language of the Supreme Court in Castenada and Hazelwood, we do not believe that we can allow the threshold at which statistical evidence alone raises an inference of discrimination to be lower than 1.96 standard deviations, whether one views this number as signifying a 5% probability of randomness using a two-tailed approach or a'2.5% probability of randomness using a one-tailed approach. If plaintiffs in Title VII cases are ever to be allowed to establish a prima facie case by evidence of disparity measuring lower than 1.96 standard deviations, this decision under the current law must be made by the Supreme Court (or Congress). Cf. Meier, Sacks & Zabell, "What Happened in Hazelwood,” reprinted in, M. DeGroot, S. Fienberg & J. Kadane, Statistics and the Law 15 (1986) (adopting 1.96 standard deviations as the threshold for Title VII cases even under the assumption that one should use a one-tailed test in Title VII litigation).
. In this respect, we follow the approach to statistical evidence adopted in
Craik v. Minnesota State University Bd,
Statistical evidence showing less marked discrepancies [than two standard deviations] will not alone establish something other than chance is causing the result, but we shall consider it in conjunction with all the other relevant evidence in determining whether the discrepancies were due to unlawful discrimination.
This approach follows Baldus and Cole in viewing disparities between 1.65 and 1.96 standard deviations as falling into an "intermediate" zone. See Baldus & Cole (Supp.) at 131-32. Numbers in this intermediate range go some of the way toward establishing a prima facie case of discrimination, but they cannot make the distance on their own. But cf., Meier, Sacks & Zabell, supra n. 9, at 12 (the appropriate intermediate zone falls between 1.96 and 2.33 standard deviations).
. 22 U.S.C. § 3905 states explicitly that “all personnel actions ... shall be made in accordance with merit principles," which excludes sex or race as a permissible criterion for a job action. See H.R.Rep. No. 992, pt. 1, 96th Cong., 2d Sess. 8 (1980). Furthermore, this section goes on to direct the Secretary of State to "prescribe such rules as may be necessary to ensure that members of the Service, as well as applicants for appointments in the Service ... are free from discrimination on the basis of ... sex.” 22 U.S.C. § 3905(b). The statute also states that this section does not extinguish any rights under Title VII. Id. § 3905(e).
. Because the Supreme Court was sharply divided on a separate issue in the Bazemore case, the Supreme Court’s unanimous opinion on this issue comes in the unusual form of a concurring opinion. The Court issued a short per curiam opinion stating:
We hold, for the reasons stated in the opinion of Justice BRENNAN, ... the Court of Appeals erred in disregarding petitioners' statistical analysis because it reflected pre-Title VII salary disparities, and in holding that petitioners’ regressions were unacceptable as evidence of discrimination.
. As the Supreme Court said in Bazemore, "[w]hether, in fact, [plaintiffs’ statistics will] carry the plaintiffs’ ultimate burden will depend in a given case on the factual context of each case in light of all the evidence presented by both the plaintiff and the defendant.” This statement contemplates that defendants generally must introduce evidence to support their attack on plaintiffs’ statistics. Mere conjectures and assertions usually will not suffice.
We note also that leading commentators support this corollary to the Bazemore rule. Baldus and Cole emphasize that "when otherwise relevant evidence is challenged on methodological grounds, the burden should normally be on the challenger (a) to present credible evidence that the statistical proof is defective and (b) to present a plausible explanation of how the asserted flaw is likely to bias the results against his or her position.” D. Baldus & J. Cole, Statistical Proof of Discrimination at vii (1986 Supp.).
. Other opinions of this court are in accord.
See Trout v. Lehman,
. This statistical evidence was further supported by additional statistics demonstrating a disparity in the potential ratings between men and women who achieved
exactly the same performance rating.
For example, men with performance ratings of "6” received, on average, higher potential ratings than the women who received performance ratings of "6." This disparity measured 2.55 standard deviations.
. The District Court actually said, “This difference produces a standard deviation of 5.84, and therefore is likely to be the product of chance less than once in 1,000,000."
. The 0.74% probability mentioned in text reflects a two-tailed approach. The District Court, again, apparently used a one-tailed approach. The District Court stated that the (one-tailed) probability was 5 in 1000, or 0.5%, but our reading of the standard tables reveals a slightly lower one-tailed probability of 0.37%. See Elementary Statistics, supra n. 8, at 479.
. Again, the 0.88% probability reflects a two-tailed approach. A one-tailed probability value for 2.62 standard deviations is 0.44%. Elementary Statistics, supra n. 8, at 479.
. The State Department’s approach here is remarkably similar to the defendant's rejected approach in Bazemore:
Respondents’ strategy at trial was to declare simply that many factors go into making up an individual employee’s salary; they made no attempt that we are aware of — statistical or otherwise — to demonstrate that when these factors were properly organized and account *107 ed for there was no significant disparity between the salaries of blacks and whites.
. On the contrary, the Supreme Court found this evidence "quite probative."
. We have no occasion to rule today that with respect to a particular disparity (like initial cone assignments) a disparate treatment claim and a disparate impact claim are mutually inconsistent. As this court has previously recognized, a disparate treatment claim can turn into a disparate impact claim if a defendant rebuts an allegation of discriminatory intent by claiming that a facially neutral selection criterion caused a disparity in selections.
See Segar,
. This study considered the effect of the following variables on initial cone assignments: level of educational attainment, major field of study, functional test scores, and sex.
See
. Appellants’ confusion over the difference between a disparate treatment and a disparate impact claim is illustrated by the following assertion in their brief: "[Plaintiffs’ expert] found that test scores substantially correlate with or explain cone assignments____ Thus, there can be no doubt that plaintiffs have established a disparate treatment [claim] in cone assignment.” Plaintiffs’ Post-Trial Brief at 22. As discussed in text, this evidence supports a disparate
impact,
and not a disparate
treatment,
claim. Appellants at times, incorrectly, suggest that they can maintain a disparate treatment claim simply by demonstrating a disparity in initial cone assignments.
See, e.g.,
Appellants’ Brief at 22. But, as discussed in text, a disparate treatment claim must prove both a disparity and discriminatory intent — even if proof of intent is circumstantial and the disparity itself raises an inference of intent.
See, e.g., Teamsters,
. Because we have concluded that appellants have properly presented only a disparate impact claim regarding the initial cone assignments of the exam-takers, the only remaining disparate treatment claim involves the initial cone assignments of those who did not take the entrance examinations. As we have mentioned, however, the appellants presented no independent statistical evidence to show that the State Department intentionally discriminated against women in this group of nonexam-takers. The data which included this group also included the exam-takers, but as any study based on this data is drastically overinclusive with respect to the no-nexam-takers, we do not believe this evidence *116 can create even a prima facie case of discrimination. Consequently, we affirm the District Court’s decision insofar as appellants failed to prove disparate treatment in the initial cone assignments of the nonexam-taker group.
. See, supra, n. 1.
. We note, however, that the statistical analysis on which the appellants' disparate impact claim was based covered only those applicants who took the examinations between 1975 and 1980 and were subsequently hired between 1976 and 1983. Apparently, there was not sufficient data from those who took the entrance examinations after 1980 and who were thereafter hired in the relevant time period, for a meaningful statistical analysis to be conducted about the effect of these examinations. Therefore, the determination of liability under the disparate impact theory can extend only to those who took the examinations between 1975 and 1980.
