LEVIN H. CAMPBELL, Circuit Judge.

For many years applicants for the position of fire fighter in the cities and towns of Massachusetts have had to pass a written multiple-choice' test (the “test”), administered by the Massachusetts Division of Civil Service. The Division appeals from a district court decision holding the test insufficiently related to a fire fighter’s duties to justify its disproportionate impact upon black and Spanish surnamed applicants and ordering a preference to be given members of those minorities, in future hiring, to remedy past discrimination. 371 F.Supp. 507 (D.Mass.1974).

Two actions brought against Boston, its Fire Commissioner and Massachusetts Civil Service officials were consolidated in the district court. The first was brought late in 1972 by the Boston Chapter, N.A.A.C.P., Inc., and by black and Spanish surnamed individuals under 42 U.S.C. §§ 1981, 1983, and the Fourteenth Amendment.1 Plaintiffs alleged that standards and procedures for recruiting and hiring fire fighters had the forseeable effect of discouraging minori*1019ty employment. The test, a swim requirement, and the disqualification of those with felony records were all challenged. A second action was brought early in 1973 by the Attorney General of the United States under Title VII of the Civil Rights Act of 1964, 42 U.S.C. § 2000e et seq., as amended by the Equal Employment Opportunity Act of 1972. Both suits sought not only orders forbidding the challenged practices but also remedial hiring of enough minority individuals to offset past discrimination.

The district court held a hearing at which evidence was introduced concerning the alleged discriminatory hiring practices and the disproportionate racial impact of the test. After the hearing the parties stipulated that it would be treated as one on the merits of the testing issue, but would cover only the “preliminary injunction stage” of the recruiting- challenge. Objections to the felony disqualification and the swim test were not pressed at the hearing, but have not been abandoned. The district court’s opinion and judgment enjoined use of the test in its current form, ordered Boston and its Fire Commissioner to engage in additional recruiting of minorities, and awarded minorities a preference in hiring to ameliorate the effects of past discrimination. Boston and its Fire Commissioner took no appeal from the court’s adverse rulings.

In Castro v. Beecher, 459 F.2d 725, 732 (1st Cir. 1972), we held that an employer may use a means of selection having a “racially disproportionate impact” only if he can show “that the means is in fact substantially related to job performance”. See Griggs v. Duke Power Co., 401 U.S. 424, 91 S.Ct. 849, 28 L.Ed.2d 158 (1971). The approach is thus two-pronged: those challenging an employment test must establish its disproportionate impact by demonstrating that, for whatever reason, it is more of a hurdle for minority members than for others; once this is shown, the test’s proponents acquire a burden of justification and must “prove that the disproportionate impact was simply the result of a proper test demonstrating lesser ability of black and Hispanic candidates to perform the job satisfactorily”. Vulcan Society v. CSC, 490 F.2d 387, 392 (2d Cir. 1973).

Some courts, including the court below, describe the showing plaintiffs must make as a “prima facie case” of “racial discrimination”. We use “racially disproportionate impact” because it is a neutral and seemingly more accurate description. A means of selection may disqualify proportionally more minority candidates than others and thus have a racially disproportionate impact, yet not be discriminatory in the constitutional sense. In Castro, for example, we approved a high school diploma requirement for police even while recognizing a disparity between blacks and Spanish surnamed candidates and others in respect to a high school education2 We thought a high school education was a “bare minimum for successful performance of the policeman's responsibilities”. Castro, supra 459 F.2d at 735. But we disapproved a paper-and-pencil test which also bore more heavily on blacks and Spanish than others because it was not proven “convincingly” that there was a “fit between the qualification and the job”. Id. at 732.

Plaintiffs usually meet their initial burden by demonstrating that minority candidates have a higher test failure rate; defendants are then put to their proof of job-relatedness. Here, however, the district court found inadequate the only available sampling showing how blacks and Spanish have fared on the test,3 although it found much evidence *1020that blacks and Spanish have held disproportionately few jobs in the fire departments of the major Massachusetts cities where most of them reside.4 Until recently relatively few minority members applied for fire fighting jobs, resulting in a very small sample from which to draw conclusions about their comparative test performance.

The district court concluded that the census figures, especially those for Boston and Springfield, when used “in support of the meager exam statistics”, established a prima facie case of the test’s discriminatory effect. The court correctly noted that

“such a finding is not determinative of the issue but merely shifts the burden to the defendant to justify the use of the exam. This is a burden a public employer should not be unwilling to assume.” 371 F.Supp. at 514.

We need not decide whether census figures showing a gross disproportion-ality in the employment of black and Spanish surnamed fire fighters and others are enough, standing alone, to shift the burden of justification to defendants. In Castro, when dealing with a relatively innocuous height requirement, we declined to impose a burden of justification upon defendants in the absence of any evidence that the height requirement adversely affected minority candidates. On the other hand, the present test, given for more than half a century, is a far more salient selection device, and it can be argued that a showing of significant disproportion-ality in minority employment, coupled with even minimal proof of a higher minority failure rate, is enough to shift to the Division of Civil Service the burden of justification.5 Cf. McDonnell Douglas Corp. v. Green, 411 U.S. 792, 802, 93 S.Ct. 1817, 36 L.Ed.2d 668 (1973). Disproportionate impact or prima facie discrimination are simply labels that aid in singling out qualifications which it is reasonable to ask an employer to justify; “complete mathematical certainty” is not required. Vulcan Society, supra 490 F.2d at 393. When widespread minority underemployment is shown to exist in a given occupation, primary selection devices should not be immunized *1021from study by placing an unrealistically high threshold burden upon those with least access to relevant data. This seems especially so when the small size of the sample may be traceable to the test’s discouraging effect as well as to unequal recruitment practices.6

But we do not decide on this issue alone. What in our view conclusively tips the scale in plaintiffs’ favor is the uncontroverted testimony, from experts called by both sides, that black and Spanish surnamed candidates typically perform more poorly on paper-and-pencil tests of this type. See Cooper & Sobol, Seniority & Testing Under Fair Employment Laws: A General Approach to Objective Criteria of Hiring & Promotions, 82 Harv.L.Rev. 1589, 1640 (1969). (Cf. Castro, supra, where black and Spanish police candidates were shown to have performed more poorly than did whites on a test of similar design.) In light of the expert testimony, we cannot say that the district court was clearly erroneous in its ultimate fact finding that plaintiffs had established a prima facie case.7 The burden thus shifted to the defendants to justify the test by showing that it was job-related.

That Massachusetts did not intentionally discriminate is immaterial. Title VII proscribes “standardized testing devices which, however neutral on their face, operated to exclude many blacks who were capable of performing effectively in the desired positions”. McDonnell Douglas Corp., supra, 411 U.S. at 806, 93 S.Ct. at 1826. See Griggs, supra, 401 U.S. at 431-432, 91 S.Ct. 849. Equitable relief under the Civil Rights Act and the Fourteenth Amendment requires no proof of malice or “fault”. “The result, not the specific intent, is what matters”. Rozecki v. Gaughan, 459 F.2d 6, 8 (1st Cir. 1972). See Inmates of the Suffolk County Jail v. Eistenstadt, 494 F.2d 1196 (1st Cir. 1974), application for cert. filed, No. 73-1992 (U. S. July 8, 1974). The question is whether the test denied applicants equal protection of the laws by creating “built-in headwinds” for those who, although qualified to perform the job, cannot pass the test. If it did, the inequality may be remedied without regard to official malice, specific intent, or actionable neglect.8

The district court found that defendants had not met their burden. The defendants’ major task in doing so was to show a substantial relationship between the test results and job performance. A test fashioned from materials pertaining to the job (here, from a preliminary fire fighters’ manual) superficially may seem job-related. But what is at issue is whether it demonstrably selects people who will perform better the required *1022on-the-job behaviors after they have been hired and trained. The crucial fit is not between test and job lexicon, but between the test and job performance.

In fairness to the state, we must not forget that civil service tests were instituted to replace the evils of a subjective hiring process. Little will be gained by minorities if courts so discourage the use of tests that the doors to political selection are reopened. Moreover, a test, even one the cutoff of which does not demonstrably predict job performance, may serve worthwhile goals in gross by sifting from the ppol of potential applicants those without enough motivation even to try to acquire the skills the test demands, and by discarding some few candidates who take the test but whose mental ability is so low that they are obviously unsuitable. Finally, it is virtually impossible for an employer to justify to a mathematical certainty every selection device. Controlled experiments in which applicants are hired without regard to their test scores may be impractical in many cases, and so those who fail the test often are not available to be evaluated on the job. If they are not, it is impossible to prove that they would have performed less competently.

Nonetheless, we think that the judgment of the district court is amply supported by the record, and we agree with it. Too many doubts persist concerning the validity of this test, the format of which has persisted for years, to make a convincing case for its unaltered use in fire departments notable for the absence of minority employees. Although perfect tests are goals as illusory as perfect schools or perfect courts, the evidence justifies compelling defendants to attempt to fashion a more sensitive test, one that will not needlessly serve as a “built-in headwind” to competent minority members, depriving both them and the Commonwealth of an opportunity for which they are qualified.

The test in recent years has had two parts, one of twenty-five questions covering current events, spelling, vocabulary and arithmetic, and the other of seventy-five questions taken verbatim from the “Red Book”, a fire fighter’s preliminary manual available from state officials. A score of 70 is required for qualification.9 Some questions call for considerable verbal skill: an applicant is asked whether “condense” means “(a) conduct (b) expand (c) evaporate (d) contract” and to recognize that “pres-surised” rather than “bouyancy” was misspelled. The second part asks questions such as: “the name given to the fire fighter who carries the play pipe end of the hose up the ladder is (a) en-gineman (b) pumpman (c) nozzleman (d) hoseman”. In some instances, questions on the second part required knowledge of apparently obsolete equipment.

The test was not professionally developed. The civil service examiner who wrote the latest versions was without professional training in employment or psychological testing, and did not consult with anyone who had such training. No analysis of required job skills was *1023conducted, and the author (who was not skilled in fire fighting) relied on the civil service “poster” setting forth the principal “duties” of a fire fighter, and on the Red Book. After a test was given, civil service personnel would analyze 50 to 100 answer sheets and, in later versions, eliminate questions that were too easy (because most people got them right) or too hard (because most answers were wrong).10 The “passing” score of 70 was selected arbitrarily, without reference to the expected or actual distribution of scores.

The general questions in the first part of the test have nothing to do with firefighting, and plaintiffs’ expert11 was of the opinion that they were not useful even as an index of general intelligence. Defendants may plan to delete such questions in the future; in any event, unless the correlations discussed below between the overall test scores and the performance of certain groups of tasks by selected fire fighters may be said to validate the first-part questions, there is no evidence at all that they are job-related.

The second-part questions deal with fire fighting, yet there is a difference between memorizing ■ (or absorbing through past experience) fire fighting terminology and being a good fire fighter. If the Boston Red Sox recruited players on the basis of their knowledge of baseball history and vocabulary, the team might acquire authorities like John Rieran but who could not bat, pitch or catch. The test does not examine traits seemingly more relevant to a fire fighter's performance such as agility, stamina, quick thinking under pressure, poise, mechanical aptitude and the ability to work with others. Experts for both sides agreed that verbal memory is not a very important attribute for the job. And unlike the motor vehicle rules covered in a driver’s test, it seems unessential whether the candidate absorbs the tested vocabulary before or after acceptance. Nomenclature and similar matters can be mastered during training and on the job. Testing them before acceptance puts a premium on ability to memorize terms that, at the time, contain only abstract meaning.'

Because the test measures a type of achievement not especially relevant to the job, one would not expect it to predict job success. It may be, however, that neither the test designer nor the court can understand how the score on the test successfully predicts. Perhaps, for example, the questions about machines are indirect indicators of mechanical aptitude. Recognizing these possibilities, the district court felt that careful scrutiny of any claimed predictive value was in order. We agree, and now seek to answer whether, against all odds, the test has been shown to be an actual predictor of on-the-job success.

Defendants attempted to establish job-relatedness through an expert engaged to conduct a validation study. See generally the test validation Guidelines of the EEOC, 29 C.F.R. § 1607. The expert stated that, in his opinion, his study established that the test was valid under the Guidelines which, according to Griggs, supra, 401 U.S. at 434, 91 S.Ct. 849, should be treated as “expressing the will of Congress”.

The validation study conducted by defendants’ expert had two parts. The first part sampled “objective” measures of job performance-, such as all of the *1024chores involved in raising a ladder and properly evolving the fire hose and its connections. These chores were further broken down into discrete components, and observers of the test subjects’ performance gave either credit or no credit for each component. The total score of each subject was then added and compared to his test score both on an overall basis and on a chore-by-chore basis. It was found that the performance on the test “predicted” (was correlated to) the performance scores on a number of the chores.

The second part of the validation study involved rating of fire fighters by their superiors on twelve “subjective” performance scales, including such diverse measures as “holding up under pressure” and “overall job effectiveness”. None of these twelve scales, nor the combination of the twelve scales together, correlated significantly to the scores on the test.12

Defendants’ expert contended that the test was valid in spite of its failure to correlate to the overall measures of job performance, and suggested that the Guidelines were satisfied by the statistically significant correlation between the test and two of the objective measures.13 The district court, however, concluded that the correlations were “barely significant” and concluded that defendants had not demonstrated that the test was “in fact substantially related to job performance”. It stated that:

“The fact that only two significant correlations were found between the exam and components of job performance — components which represent only a fraction of those duties a firefighter encounters — and the fact that those correlations were only minimally significant does not constitute ‘convincing’ evidence of job relatedness.” 371 F.Supp. at 517-518.

The district court’s conclusion was not erroneous. Substantial doubt is cast on the test’s validity by its failure to correlate at all with any of the subjective ratings or with the overall objective ratings. Only two of the tasks on the objective portion of the study correlated with the test, and the correlation there was not impressive. These rather meager signs of validity do not convince us, nor did they convince the district court, that the test is, as Castro demanded, “substantially related to job peformance”. The state has not come forward with “convincing facts establishing a fit between the qualification and the job”. Id. 459 F.2d at 732. We do not think the district court was bound to find either compliance with the Guidelines or “job-relatedness” in a more general sense. The Guidelines are not satisfied by just any correlation to any facet of job performance. They require close scrutiny of “the use of a single test as a sole selection device . . . when that test is valid against only one component of job performance”. § 1607.-5(c). The test here, and the validation study performed by defendants’ expert do not survive close scrutiny.

We have already indicated several difficulties with the test: it was not pro*1025fessionally developed; its content does not appear to be job related; the cutoff score of 70 is arbitrary; the validation study reveals no correlation to overall measures, either subjective or objective, and only minimal correlation to two individual objective tasks. Another significant defect in the evaluation — a defect that from defendants’ point of view is perhaps unavoidable, although see Guidelines § 1607.9(b) — is that the groups evaluated consisted entirely of fire fighters who had passed the test. No one who had received a score lower than 70 was included. The validation study was therefore “concurrent” rather than the more useful “predictive” type. See United States v. Georgia Power Co., 474 F.2d 906 (5th Cir. 1973) (validation study unacceptable because concurrent). Since no one scoring less than 70 was evaluated, there is no evidence that those failing would be less capable as fire fighters than those passing. Although there is probably some score that would reflect substantially deficient motivation or ability to understand or communicate, so that such an applicant would be unsuitable even for a job that does not emphasize paper-and-pencil skills, we do not know where that point may be.

The data which the study did produce is said to reveal that fire fighters who passed and got high scores were superior in a few ways to fire fighters who passed with lower scores. We are then asked to infer that applicants who fail the test would have performed the same tasks more poorly still. The inference is a difficult one to draw. A very high passing score might indicate a special motivation or knowledge. On the other hand, the differences reflected by test scores in the range of 50 to 80 might be altogether negligible. We cannot tell. But a correlation within the range of 70 to 100 can easily be produced by data that would indicate no significance in differences in the 50 to 70 range. Particularly when the data are concurrent, the correlations should be more striking than they were here.14

We also find it significant that the defendants never attempted to demonstrate that alternatives to the test are unavailable. The Guidelines require such proof, § 1607.3. In its absence, we hesitate to approve a test of racially disproportionate impact and slight, if any, validity.

Finally, although we do not find them necessary to the result we reach here, we have doubts about the way in which the validation study was performed. The sample for the study was selected from a single administration of the test. But on average those hired for civil service jobs are selected in their order on the roster of eligibles, which is ranked from highest to lowest (with the exception of veterans) in test score. From any given list the highest ranking persons have been hired the earliest. Therefore, the sample for the study would seem biased; those with higher scores also had additional experience on the job. Could it be that the few observed correlations are explained by on-the-job experience rather than test score ?

Moreover, no attempt was made to explain the effects, if any, of filtering that occurred even after the examination. Those who passed the test but would have made poor fire fighters may have been filtered from the sample by the time of the study because of voluntary resignations, release for unsuitability in training or on the job, failure of the physical, etc. Those remaining in the *1026sample of the study are a select group indeed, while those who failed the test but would perhaps have made good fire fighters are entirely missing from the sample.

Even if the test is minimally valid, we might also doubt its use as an absolute cutoff. A test seemingly should receive no more weight in the selection process than its validity warrants. Use of a minimally valid test as an absolute cutoff is questionable even if more limited uses of the test are acceptable.

Finally, no effort has been made to include minorities in the sample of the study. The Guidelines so require, §§ 1607.4(a), 1607.5(b)(5). The Georgia Power court has agreed. The challenge leveled at the test is that it has a disproportionate and unwarranted impact on minority group members; it is difficult to quantify this impact — or to disprove it — unless some of those who were allegedly adversely affected are included in the study. We understand the practical difficulties, particularly in a job population almost devoid of minorities. But we urge that consideration at least be given to the problem as a new test is evolved.

There are, in sum, too many problems with the test for us to approve it here. The district court was correct in ruling that defendants had not discharged their burden.

III

There remains for discussion the remedy selected by the district court. In Castro we required the district court, as a means of ameliorating the continuing effects of past discrimination, to institute a program of color-conscious relief that included priority pooling of minorities, with choices made from the priority pool until it was exhausted. See 365 F.Supp. 655 (D.Mass.1973) (consent decree on remand).

The relief ordered in the instant case follows this pattern. The court first enjoined use of the eligibles list from the most recent test and enjoined any further administration of any similar test until it was validated. This was the proper course under our decision in Castro. Cf. Vulcan Society, supra; Bridgeport Guardians, Inc. v. Bridgeport CSC, 482 F.2d 1333 (2d Cir. 1973). The court then created four eligibility groups: in Group A will be all black and Spanish surnamed applicants who took and failed any previous test, but who pass any new and valid test.15 In Group B are all persons on the current eligibility list. Group C will contain all black and Spanish surnamed persons who do not belong in Group A but who pass a new examination and are otherwise qualified. Group D will be composed of all other persons who pass the new examination. Any Massachusetts community subject to the Civil Service law and having a minority population of one percent or more must submit requisitions to the Division of Civil Service for any fire fighter openings they may seek to fill. The Division is then to certify candidates to such departments by means of a matching procedure designed to ensure that each Group receives proportional representation in accordance with their qualifications. Groups A and B are to be given initial preference on a one-to-one basis, and the other Groups are to be drawn upon as A and B are exhausted. The ratios to be implemented for Boston and Springfield are slightly different than those for the rest of the state, in recognition of their larger minority populations. In all cases new eligibility lists from successive entrance tests shall be used to replenish Groups C and D. The *1027decree remains in force, for each local fire department, until that department attains sufficient minority fire fighters to have a percentage on the force approximately equal to the percentage of minorities in the locality. After that point the locality is freed from the decree and may make appointments on any nondiseriminatory basis. In no case must any unqualified minority person be appointed; if no qualified applicants are available, none will be appointed.16

Defendants contend that the color-conscious relief imposed by the district court is unconstitutional. The argument is without merit. The relief goes no further than to eliminate the lingering effects of previous practices that bore more heavily than was warranted on minorities17 This court has stated:

“[I]t would be consistent with the goal of equal opportunity to give first priority to members of a minority that had previously been denied equal opportunity, if those members were otherwise as qualified as were qualified members of the majority population.” Associated General Contractors of Massachusetts, Inc. v. Altshuler, 490 F.2d 9, 18 (1st Cir. 1973), cert. denied, 416 U.S. 957, 94 S.Ct. 1971, 40 L.Ed.2d 307 (1974).

See also, e. g., Morrow v. Crisler, 491 F.2d 1053 (5th Cir. en banc 1974) (withholding quota relief is abuse of discretion); NAACP v. Allen, 493 F.2d 614, 618-620, nn. 7-10 (5th Cir. 1974) (collecting cases); United States v. Masonry Contractors Ass’n of Memphis, Inc., 497 F.2d 871 (6th Cir. 1974); Carter v. Gallagher, 452 F.2d 315 (8th Cir. en banc 1974); Rios v. Enterprise Ass'n Steamfitters Local 638, 501 F.2d 622 (2d Cir. 1974). The goal of color blindness, so important to our society in the long run, does not mean looking at the world through glasses that see no color; it means only that all colors are moral equivalents, to be treated on an equal basis. We believe that our society is well served by taking into account col- or in the fashion used, and carefully limited in extent and duration, by the district court.

Defendants next contend that, even if the use of color-conscious relief is not forbidden by the Constitution, it is prohibited by Title VII itself, which states in § 703(j), 42 U.S.C. § 2000e-2(j):

“Nothing contained in [Title VII] shall be interpreted to require preferential treatment to any individual . . . because of race, color, religion, sex, or national origin of such individual on account of an imbalance which may exist with respect to the total number or percentage of persons of any race, color, religion, sex, or national origin ... in comparison with the total number of [sic] percentage of persons of such race, color, religion, sex, or national origin in any community. . . . ”

Defendants contend in effect that, whatever may be the case when a court undertakes to correct the continuing ef*1028fects of previous intentional racial discrimination, § 703(j) conclusively withdraws from the district court the power to grant color-conscious relief when the discrimination is unintentional. Although this argument has recently found eloquent support in the dissent of Judge Paul Hays in Rios, supra, we do not accept it.18 We agree with the majority in Rios that relief undertaken in order to redress past discrimination, whether or not intentional, is permitted. Section 703(j) deals only with those cases in which racial imbalance has come about completely without regard to the actions of the employer. And the dispute over the intent of the framers of Title VII is largely ancient history.19 See generally Comment, The Philadelphia Plan: A Study in the Dynamics of Executive Power, 39 U.Chi.L.Rev. 732 (1972). Title VII was amended in 1972 and the legislative debates at that time, particularly the failure of Congress to pass the Dent Amendment, which would have foreclosed all affirmative action plans and racial balance relief, lend support to an inference that Congress ratified the power of the courts to impose color-conscious relief of the sort that had been approved in several cases at the time the attempts to amend the amendments to Title VII failed. For the history of this episode, see id. at 747-60. We believe that the history of the 1971-72 amendments, and failures to amend, together with the weight of judicial authority, warrant our rejection of defendants’ Title VII claim.

Edward D. Kalman, Assistant Attorney General, with whom Robert H. Quinn, Attorney General, and Walter H. Mayo III, Assistant Attorney General, were on brief, for appellants. David L. Rose, Attorney, Department of Justice, with whom J. Stanley Pot-tinger, Assistant Attorney General, James Gabriel, United States Attorney, and James M. Fallon, Attorney, Department of Justice, were on brief, for United States, appellee. Patrick J. King, with whom Thomas A. Mela was on brief, for N.A.A.C.P. et ah, appellees.

The other arguments raised by defendants have been examined and found to be without merit. The judgment is

Affirmed.

ON MOTION IN OPPOSITION TO TAXATION OF COURT COSTS

PER CURIAM.

In Boston Chapter, N.A.A.C.P., Inc. v. Beecher, 504 F.2d 1017 (1st Cir., 1974), this court affirmed the district court’s decision, 371 F.Supp. 507 (D.Mass.1974), finding a state-administered entrance examination for firemen to have a racially discriminatory impact, and ordering injunctive relief. Under Rule 39(a) of the Federal Rules of Appellate Procedure, costs of appeal are to be taxed against the losing party unless otherwise ordered. Appellees N.A.A.C.P. et al. filed a bill of costs for $49.50, and appellants director and commissioners of the state civil service division oppose a taxation of these costs against them on the ground that it would constitute an award of money against the state in violation of the eleventh amendment. While the amount in controversy here is relatively insignificant, the issue raised *1029is of importance to the everyday practice before the federal courts.

Appellants rely on Edelman v. Jordan, 415 U.S. 651, 94 S.Ct. 1347, 39 L.Ed.2d 662 (1974), in which the Supreme Court held that the eleventh amendment barred a federal district court from awarding retroactive benefits under a federal-state public aid program. The Court in Edelman distinguished what it found in effect to be a money judgment payable out of the state treasury directly, from the prospective injunction upheld in Ex parte Young, 209 U.S. 123, 28 S.Ct. 441, 52 L.Ed. 714 (1908), even though in the latter ease the necessity for state officials to shape their conduct to the Court’s mandate might have had a substantial “ancillary effect” on the state treasury. 415 U.S. at 668, 94 S.Ct. 1347. Since Edelman, courts of appeals have divided on whether a federal court may award attorneys’ fees against an uncon-senting state, an issue which is analytically similar to the question of awards of costs. An award of attorneys’ fees was upheld in Class v. Norton, 505 F.2d 123 (2d Cir., 1974). Before Edelmcm, the same result was reached in Gates v. Collier, 489 F.2d 298 (5th Cir., 1973), and in Sims v. Amos, 340 F.Supp. 691 (M.D.Ala.1972), aff’d mem., 409 U.S. 942, 93 S.Ct. 290, 34 L.Ed.2d 215 (1973). However, two other courts of appeals, attaching little precedential weight to the Supreme Court’s summary affirmance without opinion in Sims, have cited Edelman as requiring the contrary conclusion. Jordan v. Gilligan, 500 F.2d 701 (6th Cir., 1974); Skehan v. Board of Trustees, 501 F.2d 31 (3d Cir., 1974).

As the Court in Edelman observed, 415 U.S. at 667, 94 S.Ct. 1347, the line between permissible relief and that barred by the eleventh amendment will not always be clear. And we acknowledge that an award of court costs cannot be neatly categorized as either prospective or retroactive. An award of costs does operate as a direct levy on the state’s general revenues and as a form of compensation to the winning party. On the other hand, costs are not awarded for accrued liability, but rather are assessed for certain litigation expenses in accordance with the generally mechanical provisions of Rule 39. The precise purposes behind this rule and the prior traditional practice of taxing costs are not clear, but no doubt an award to some degree serves to discourage litigiousness and frivolous claims, as well as to induce compliance with the rulings of the court. In this sense allocation of costs is an incident to the court’s jurisdiction and judgment in the main action.

In Fairmont Creamery Co. v. Minnesota, 275 U.S. 70, 48 S.Ct. 97, 72 L.Ed. 168 (1928), there is support for the view that the power to tax court costs is to be characterized as an incident to the hearing and not within the scope of a state’s sovereign immunity if the federal court’s exercise of jurisdiction over the state party in the action in the main is valid. The Court in Fairmont held that sovereign immunity did not protect a state against a judgment for costs in litigation before the Supreme Court on an appeal from the final judgment of the highest state court. While acknowledging that a sovereign is immune from costs awarded by its own courts, the Court stated that a state loses some of its sovereign character when it is forced to appear before the Supreme Court under the Supremacy Clause on a matter of federal and constitutional law. According to the Court the “incidents of the hearing are those which attach to the regular jurisdiction of this Court,” and the awarding of costs was one such incident that had been “the invariable practice” in judgments against a state in both civil and criminal cases, before that Court. Id. at 77, 48 S.Ct. at 100. Courts in this circuit have awarded costs and attorneys’ fees against state officials when the state was the real party in interest. E. g., Hoitt v. Vitek, 361 F.Supp. 1238, 1255 (D.N.H.1973), aff’d, 495 F.2d 219 (1st Cir. 1974).

In view of the above considerations and precedents, we conclude that the eleventh amendment does not immunize appellants from being taxed for costs.

. This was brought and certified as a class action. The first certified class was “All black or Spanish-surnamed persons who have applied for the position of firefighter in any fire department . . . subject to Massachusetts Civil Service law, but have not become eligible for appointment under existing requirements”. The other class included all who never applied because they were deprived of information concerning fire fighter employment opportunities.

. Cf. Berkelman v. Unified School Dist., 501 F.2d 1264, pp. 1265-1268 (9th Cir. 1974).

. Applicants taking the August 1971 test were given the option of identifying their race, color or national origin. 84% did so. Out of 33 individuals self-identified as black or Spanish, 13 or 39% passed. The court felt the statistics were “obviously meager” and declined to find that “in themselves” *1020the.v established a prima faeie ease, although, insofar as they went, the statistics showed a disparity, because of 3089 Caucasians, 1737 or 56% passed. Cf. Chance v. Board of Examiners, 458 F.2d 1167 (2d Cir. 1972) ; Carter v. Gallagher, 452 F.2d 315 (8th Cir. 1971).

. 1970 census figures show that in Boston, with a 1970 black population of over 16%, black fire fighters ’ made up less than 1% of the force; Springfield’s black population exceeded 12% but blacks made up less than 0.-2% of the force. And, as the district court pointed out, the combined black and Spanish minority in Boston is today probably closer to 23% than to 16%. Cambridge shows a lesser discrepancy, although still a very sizable one. New Bedford and Worcester, with smaller minority populations, reflect the least disparity. Census data as to Spanish surnamed individuals show that nearly 3% of Boston’s population, but only 0.1% (two individuals) of the fire department, fit within that category. Springfield has no Spanish surnamed fire fighters, although Spanish comprise 3.4% of its population. Cambridge, New Bedford and Worcester had substantial disparities.

We find no error in the district court’s reliance on statistics pertaining to the cities proper rather than to metropolitan regions. Cf. Town of Milton v. CSC, Mass.1974, 312 N.E.2d 188. We note that even within the Greater Boston area, with a black population in 1970 of 4.6%, the disparity is substantial ; the same is true in Springfield. ¡See also Associated General Contractors of Massachusetts, Inc. v. Altshuler, 361 F.Supp. 1293 (D.Mass.1973), aff’d, 490 F.2d 9 (1st Cir. 1973), cert. denied, 416 U.S. 957, 94 S.Ct. 1971, 40 L.Ed.2d 307 (1974).

It would, of course, be unreasonable to expect perfect correlation between ethnic groupings and the holders of a particular job, but when there is only one minority fireman out of 475 in a city like Springfield with a large black or Spanish surnamed population, the imbalance is obvious.

. Mayor of the City of Philadelphia v. Educational Equality League, 415 U.S. 605, 94 S.Ct. 1323, 39 L.Ed.2d 630 (1974), is not to the contrary. The Court emphasized that in many cases it is proper to infer discriminatory intent or impact from racial statistics. Id. 415 U.S. at 619-021, 94 S.Ct. at 1333. See also Biss, A Theory of Fair Employment Laws, 38 U.Chi.L.Rev. 235, 270-81 (1971).

. Plaintiffs argue tliat the vice of the test is as much its tendency to discourage minority members from applying as that minority applicants may be exj)ected to perform less well. If so, this is a reason for not holding the smallness of the sample against plaintiffs. Moreover, the sample of those minorities who do take the test is self-selected; and, for all we know, only especially motivated and competent minority members took the test. The 39% passing rate might be higher than the rate would be from a random sample. In any event where, as here, the minority sample is disproportionately low, it is both dangerous to rely too heavily on the figures and unfair to ignore them entirely.

. Any written test in English would seem obviously harder for persons whose native tongue is Spanish. Other evidence at trial supported the view that the test bore more heavily on minorities. For example, the court found that Boston fire fighters were often recruited by relatives and friends. Minority apjdicants, lacking access to such sources of aid and advice, might find it harder to absorb the technical subject matter covered in the test. They could try to memorize the “Red Book”, but such information is doubtless more easily absorbed by those exposed to fire fighters and firehouse routines. Such considerations do not, of course, invalidate the test; they are merely additional reasons for inquiring into its utility as a tona fide predictor of job performance.

. The standard would, of course, be different were we considering a damages award against individual officials. Cf. Palmigiano v. Mullen, 491 F.2d 978, 980 (1st Cir. 1974).

. Applicants with a score of 70 or above were rated for prior training and experience, and this rating was incorporated into a composite rating. The composite was determined 70% by the test and 30% by the training and experience score. A composite score under 70 resulted in rejection. Thereafter the applicant was required to pass a medical, meet certain strength requirements, and demonstrate good moral character (including an absence of felony convictions). Applicants meeting these further requirements are then ranked by their composite scores from 100 down to 70. However, disabled veterans are ranked ahead of veterans, and veterans ahead of non-veterans regardless of score. The list of eligibles thus created continues in force until exhausted or for two years. Previous lists have all expired by exhaustion when all eligibles were offered positions.

Localities may set their own additional education and residence requirements. Fifteen departments, not including Boston, require a high school diploma or equivalency. All localities require the applicant to be a resident.

. Because the questions often changed it seems possible that the same person in sue-cessive administrations of the test -would not receive the same or a comparable score. This is a problem in testing even when the questions are constant. See American Psychological Ass’n, Standards for Educational and Psychological Tests & Manuals 25-32 (1966), incorporated by reference in EEOC Guidelines, 29 C.F.R. § 1607.5(a). No reliability study of the tests has ever been conducted.

. The district court committed no abuse of discretion in permitting plaintiffs’ expert to testify. She was a psychologist with distinguished credentials who was engaged in a fire fighters’ test validation study. The qualifications of experts are largely matters within the district court’s discretion. See Texas Instruments, Inc. v. Branch Motor Express Co., 432 F.2d 564, 566 (1st Cir. 1970).

. The correlation between the test and all supervisory ratings was r=0.078. Were this statistically significant, it would indicate that the grade “explained” only approximately 0.6% of all observed variance in fire fighters’ on-the-jcb performance.

. The objective portion of t’..e study produced several correlations that were statistically significant (likely to occur by chance in fewer than five of one hundred similar cases) and practically significant (correlation of ± 0.3 or higher, thus, explaining 9% or more of the observed variation). Of the seven statistically significant correlations, four were not practically significant. Of the remaining three correlations, only two were correlations between the test and an objective measure; the third was a correlation between the composite mark and an objective measure.

Defendants’ expert also found that there was no statistically or practically significant correlation between test scores and all objective criteria taken together.

Thus of all possible measures, the test score produced a meaningful correlation only to air mask operation and to the subject’s performance as loop man in the evolution of the hose line.

. Another way to make the same point is to say that a test should be valid for the purpose for which it is used. Apparently nearly all those who score over 70 and are placed on an eligibles list are eventually offered employment. Thus, the test is not chiefly used to choose between a score of 70 and a score of 95, and it is not particularly helpful to know that the applicant with a score of 95 is apt to excel. The test’s main use is to choose between the applicant with, for example, a 65 and the applicant with a 71; but because the validation study is concurrent, we do not know whether it is valid in that range.

. Group A shares with Groups C and D the criterion of passing a new and valid test. It is possible that no such satisfactory examination can be devised and validated within a reasonable time. If this should occur, the district court may of course amend the decree, should a party so request, to provide for placement into groups based on alternative non-discriminatory criterion,

. Nor will the order result in the firing of those who have received provisional appointments pursuant to the current, but now invalid, eligibility list. Those temporary appointees will be placed at the top of the (5 roup B list, and their appointments made permanent on a one-to-one basis with permanent appointments from Croups A and C. None will be fired.

. Theoretically it might be possible for defendants to demonstrate that, even though the test has an unjustified racially disproportionate impact, the disparity in resulting emjdoyment is primarily accounted for by neutral means. In the case of many northern cities, this could come about because blacks and Spanish surnamed individuals had arrived in Boston in large numbers only in recent years, so that those fire fighters hired prior to the recent increase in minority population were “justifiably” predominately white. The data support the contrary hypothesis, however. In spite of the increased percentage of minorities in the Boston area, the Boston Fire Department has since 1960 appointed 805 fire fighters, only six (0.745%) of whom were black and two (0.248%) of whom were Spanish sur-named.

. Compare Blumrosen, Strangers in Paradise : Griggs v. Duke Power Co. and the Concept of Employment Discrimination, 71 Mich.L.Rev. 59 (1972), with Wilson, A Second Look at Griggs v. Duke Power Company : Ruminations on Job Testing, Discrimination & The Role of the Federal Courts, 58 Va.L.Rev. 844 (1972) ; Developments in the Law — Employment Discrimination & Title VII of the Civil Rights Act of 1964, 84 Harv.L.Rev. 1109 (1971) ; Note, Employment Testing: The Aftermath of Griggs v. Duke Power Company, 72 Colum.L.Rev. 900 (1972).

. In the context of this suit the dispute might also be an unnecessary one. Only the suit by the United States is brought under Title VII. The NAACP’s § 1893 suit may be completely free of whatever limitations Title VII may contain. In any case, there is no need for us to decide the matter here.