Proper Scientific Controls for ESP Experimentation

To some extent, psi research -- which legitimately falls within the domain of psychology -- has suffered by borrowing the methods of psychology. Psychological methods, in general, have not been designed to test for scientific anomalies and extraordinary claims. While there can be little doubt that psi research studies are generally far more carefully scrutinized than other studies in psychology, this, in itself, is not sufficient reason for supposing that psi's existence has been scientifically established. If anything, a close examination of the methodological rigor used in psi research might tend to create doubts about the validity of a great deal of research in psychology -- as well as other behavioral and social sciences.

The usual logic is to suppose an ESP effect only when all other alternatives have been ruled out -- not by post hoc analysis, but by design of the experiment. The reason for this is that ESP is not an established process; it is only a label that we assign to certain anomalous findings. To verify the anomaly it is necessary to absolutely rule out sources of error.
 

Randomization

In theory, if ESP targets are chosen "at random," there is no logical way that the next target in a sequence can be predicted. In a table of random numbers, all of the numbers occur with a frequency approximating chance expectation. Furthermore, all sequences of numbers (i.e., doublets, triplets, etc.) also occur with a frequency approximating chance. We can be certain that a table of random numbers meets the various statistical tests that have been devised for randomness. However, since such tables are published and generally available in libraries there is always a risk that "random" target sequences could be predicted by someone who obtained access to the random number table being used in a particular study. (A standard precaution against this possibility in psychological and psi research is to randomly select the entry point into a random number table.)

Psi researchers can generate their own random number sequences, without reference to a published table of random numbers, by using an electronic or mechanical Random Event Generator (REG). The best of these devices rely on random sources which are quantum mechanical in nature, such as electronic white noise or radioactive decay. Due to the uncertainty principle in quantum mechanics, the output of such devices in theoretically unpredictable; they are thus the most random sources known to nature.

Ironically, it is entirely possible, however, that a truly random source will provide a short-term output that fails the statistical tests for randomness mentioned above. Thus, a genuinely random target sequence can be problematic if it mimics the properties of a non-random sequence. A test subject who receives trial by trial feedback in such a situation might make inferences that happen to match the random output. Beyond chance scoring in such an experiment would not necessarily be due to ESP. Therefore, the ideal experiment must not only be derived from a true random source; it must also meet post hoc tests for randomnicity. Such tests will also detect more serious sources of bias that may develop within electronic or mechanical apparatus designed to generate random events.

Non-random target sequences may not necessarily, in themselves, account for an alleged ESP score. The main concern is that subjects will learn the characteristics of previous target sequences, and use this information to infer the characteristics of future target sequences. This, of course, requires periodic feedback on targets. Or, subjects may simply have a personal bias towards some targets that, coincidently, matches the patterns that emerge in a non-random sequence. (In fact, this can occur in target sequences which are truly random, as well, creating a -- generally short term -- false impression of beyond chance, psi scoring.) This can be detected if the features of the target sequence are matched or correlated with non-random patterns in the subject's calls -- other than those intended for each target. (It is entirely possible that a subject's successful ESP calls will be unrelated to the non-random features in a target sequence.) Whether or not a non-random feature can account for an alleged ESP effect, it is a disquieting situation when departures from randomness are severe, and of unknown origin. If a critical procedure such as randomization broke down, this raises the possibility that there were other breakdowns, or failures to carry out the experimental plan. At least one case of experimenter fraud (involving British mathematician S. G. Soal) was traced through a detection of non-random target sequences.

Originally, in early ESP studies, randomization was achieved through manual shuffling of card decks. It is now generally agreed that this is not adequate. It could be argued that informal randomization invalidates any ESP experiment. If unbalanced decks are used, then psi-like artifacts can easily emerge.

Randomization procedures, or REG test machines, must be thoroughly described in experimental reports. If random number tables are used, the application procedures must be fully described. This is particularly important when untrained assistants are asked to generate target sequences.

Electronic random event generators should employ a switching system to correct for possible systematic bias. However, there is always the possibility that the output bias could, in some manner, correlate with the switching sequence itself. This would occur if the REG bias oscillates. Is there any way to prevent this occurrence? Schmidt, at a 1974 research meeting at the Foundation for Research on the Nature of Man, suggested doing this by incorporating the Rand Corporation random number tables into such a switching system.

When targets are obtained from a random event generator (REG) rather than from a predetermined sequence, then it is critically important to include control runs -- in a systematic fashion (i.e., a counterbalanced sequence of control and experimental runs). In spite of this importance, there is no way to theoretically ensure that such control runs are immune from psychokinetic influence. This uncertainty places an absolute limit on the degree of precision which is possible in psi research using REGs. Undoubtedly, it will also provide sufficient grounds for some skeptics to refuse to accept any REG data as evidential of psi functioning.

AS REGs are continually being modified or replaced, there is not much standardization. This makes it all the more important to introduce systematic controls, especially as a guard against short-term generator bias.

One means of systematically controlling for generator bias is to randomly pair control and experimental trials. This is quite easily accomplished when one has an REG interfaced to a computer.

To what extent can one make allowances for non-random target sequences and salvage an experiment which is flawed in this respect? This is an extremely important question because (a) pure randomness is an ideal which can never truly be reached in the real world; and (b) valid random procedures may, in fact, produce target sequences which in retrospect do not appear random -- i.e., at Monte Carlo when black came up 32 times in a row. In other words a genuinely random sequence of sufficient length will have many subsequences which do not appear random.

Failure to record actual target sequences is a severe shortcoming in any ESP experiment.

A crucial issue in examining target sequences generated by a REG for ESP experimentation is whether non-random sequences may have been produced by PK. There is no way to possibly control for this factor other than by using pseudorandom REGs or by referring to a random number table. (Even then some sort of macro-PK might theoretically be involved.)

During PK experiments with REGs, control tests must check for temporal stablity of the random sources during the course of the experiment. Such randomness tests should be conducted in the actual experimental environment with all peripheral equipment attached. As a precaution against PK influence on the control tests, experiments could be designed with various blinds to prevent both subjects and experimenters from knowing when and how these control tests will be run. Ultimately, of course, at least one experimenter who designs the study will have to know the arrangements. A further control would be to have this individual be someone who has no particular history of manifesting psi in REG testing situations. (However, all of these controls will ultimately lead to uncertainties.)

Control studies must also specify the physics and constructed parameters of the experimental apparatus to assess the possibility of environmental influences.

It should be mentioned that some Schmidt generators have been tested by generating sequences of over a million trials, and have shown no evidence of either short- or long-term bias. Hence the problem is not a severe one with a well-designed generator which has been thoroughly tested.
 

Sensory Leakage 

A standard rival hypothesis to the hypothesis of ESP is that sensory leakage occurred and that the receiver was knowingly or unknowingly cued by the sender or by an intermediary between the sender and receiver. As early as 1895, psychologists described "unconscious whispering" in the laboratory and wwere even able to show that senders in telepathy experiments could give auditory cues to their recievers quite unwittingly.,, Ingenious use of parabolic sound reflectors made this demonstration possible. Many researchers in the early years of experimental psychology and psi research gave early warnings on the dangers of unintentional cueing.,,, The subtle kinds of cues described by these early workers are just the kind psychologists have come to realize mediate experimenter expectancy effects found in laboratory settings.

In designing experiments to prevent sensory leakage, experimenters cannot assume that there are no tricksters present among the subjects. Precautions must be taken that would prevent the most skilled of tricksters or magicians from succeeding in obtaining normal sensory information about the targets.

Experimental reports must clearly describe the relative location of subjects and targets. 

If visual targets are within proximity of the subject, they must be in an opaque container, unopened until the subject's responses have been recorded. The opaqueness of the container should be objectively assessed and the container should be kept well out of the subject's reach. Subjects must not be present at any time while the target materials are being prepared.

If the subject is allowed access to the container, it must be made "fraudproof." This is no easy task. Chemist George Price suggested, in his 1955 Science article advocating the presumption of fraud, that a metal container be used with a cover welded on and photomicrographs taken of the welds. Even with such precautions a clever subject, using advanced technology, under unobserved circumstances could devise ways of penetrating such a device. For day to day research, such precautions are not practical. Thus it is simply better not to allow the subjects to have unobserved access to containers with target materials.

Particular care must be taken with using computerized REG devices that subjects not remain unobserved with the computer device. James Davis, of the Institute for Parapsychology in Durham, North Carolina, has observed that a subject who has access to the computer, knows the data format, and has sufficient programming knowledge, might subvert experimental precautions.


James Davis, parapsychologist who was instrumental in detecting experimenter fraud in 1974 at the Foundation for Research on the Nature of Man in Durham, North Carolina

In addition, it is also unclear as to whether computers might provide sensory cues regarding target information. This might be in terms of subtle audio or electromagnetic signals that could aid a sensitive individual in distinguishing different targets.

In telepathy experiments one must exclude the possibility that the subject learns about the targets indirectly, through cues from the agent. Thus any individual with information regarding the targets must not be within auditory or visual range of the subject. This often means that putting subject and agent in adjacent rooms is insufficient. Sound isolation must also be insured. [Electromagnetic shielding between rooms can also prevent leakage through the use of radio transmission between agent and percipient -- an issue which may more appropriately come under the heading of cheating.] In some buildings, sounds travel quite readily between distant rooms. Furthermore, any communication between subject and agent with regard to timing of trials (i.e., intercom or knocking on the walls) may also inadvertently contain sensory information regarding ESP targets. If "ready" signals are used, they should operate only from the percipient's room to the agent's room, so that cueing is eliminated.

Precognitive testing, where targets are not selected until some time after the subject's calls are recorded allows the tightest control over conditions that might otherwise contribute to sensory leakage.

Experimenter cueing can be eliminated by keeping experimenters blind as to target order. This means that experimenters cannot administer their own ESP tests; or, if they do, they must be shielded for all sensory contact with subjects. In order to insure experimenter blindness with regard to targets, experimenters should have no sensory contact with individuals who are aware of the target order. Otherwise it is possible to hypothesize a chain of nonverbal communication.

In free response experiments with independent judges, it is also essential that the judges be shielded from all sensory cues just as if they were subjects. In addition, judges must be provided with no sensory cues whatsoever regarding the order of the various calls made by subjects. Such cues must be edited out of any transcripts provided to judges. If such cues are provided, judges may succeed in time-ordering both the target and the response sets, thus contaminating the judging process with additional logically derived information.

Providing feedback in a "closed deck" target pool situation reduces, in effect, the degrees of freedom of the final target (even if the total pool is unknown to the subject) and thus may be a source of experimental contamination. This may occur if subjects avoid producing imagery related to any targets for which imagery has already been provided.

Handling cues are also best avoided. Both judges and percipients may detect creases, marks, smudges, temperature differences or other artifacts that result if actual targets have been handled and then mixed in with targets from a pool for judging. Handling cues may also result when targets placed in envelopes are opened and then resealed or placed in new envelopes, as has sometimes been done.

In studies where ESP scoring is correlated with other factors, such as personality variables, it is important that subjects be given no feedback on their ESP scores prior to other testing. Otherwise it is possible that the personality score is influenced by feedback from the ESP test. 
 

Subject Cheating

Fraud is encouraged in studies where incentives are offered (i.e., employment, publicity) for high ESP or PK scores. Some researchers deal with this issue by never working with well-known "psychics" and by insisting that all subjects participate anonymously in their studies. These precautions, however, are not sufficient to preclude fraud.

During an eight month period in 1983 and 1984 researchers at the Department of Psychology, University of Edinburgh, in Scotland, conducted twenty experimental sessions with a teenager who claimed to possess macro PK abilities. Although many phenomena were apparently observed, none of these manifested under well-controlled conditions. Eventually, researchers suspected fraud and set up hidden cameras which succeeded in revealing blatantly fraudulent activities. When confronted with this evidence of his deception, the subject denied that his activities has been fraudulent. However, several months later he confessed that he was a practicing magician "who had wished to see if it were possible for a magician to pose successfully as a psychic in a laboratory."

While, ultimately, he did not succeed in fooling the researchers, he did manage to take up approximately 60 hours of their time. Psi researchers are somewhat vulnerable to this type of invasion, because -- in attempting to establish conditions conducive to the alleged phenomena they wish to investigate -- they attempt to establish good rapport and thus avoid treating experimental subjects with suspicion.

Human nature is often unpredictable; this is sometimes the case with regard to fraudulent and criminal activity, especially when it occurs in the absence of apparent motive. Some subjects may simply get a kick out of fooling experimenters (especially when researchers have claimed fraudproof conditions). In one clever incident, a research study conducted at Harvard University, the agent and percipient were 100 feet apart, with four closed doors separating their rooms. The student subjects used a confederate hidden in one of the rooms to aid in passing a signal. The system was so successful that the percipient was able to guess the correct color of a deck of playing cards, on all 52 attempts. The researcher, George Estabrooks was initially fooled, even though he knew that cheating would be attempted. 

A basic precaution, when subjects are provided with feedback regarding targets, is that subjects' calls be clearly recorded and submitted to the experimenter before feedback is provided.

When REGs or other testing equipment is used, precautions must be carefully instituted to prevent subjects from resetting counters or, in any other uncontrolled manner, manipulating the parameters of the device.

In developing precautions against radio transmission of target information to an ESP percipient, care must be made to objectively assess electromamgnetic shielding of laboratory rooms at the time of the experiment. Often shielding characteristics change over time, as modifications are made to experimental rooms (e.g., for insertion of cables or ventilation ducts).

The possibility of such cheating is obviously lessened when controls are such that no friends or associates of the percipient are allowed access to ESP target information (i.e., by participating as an agent in an ESP test). 

Another form of control against the possibility of cheating is to have the subject and the agent observed (or videotaped) continuously during the experimental period. Some experimenters have attempted to control for fraud by locking subjects into experimental rooms during testing. Naturally, this control would be of little success if the subject is skilled at picking locks.

Readers interested in developing a more detailed knowledge of the many ways in which fraud may be committed to create the illusion of psychic functioning will want to study the literature of professional magic. 
 

Recording Error

The possibility of unconscious errors in the recording of experimental data has been observed in psi research since the 1930s., In a meta-analysis of 139,000 recorded observations in 21 psychology studies, Harvard researcher Robert Rosenthal found that about 1% of all observations were in error. Of the errors committed, twice as many favored the hypothesis as opposed it. Thus, such errors are a very real, yet small (i.e., 0.33%), factor in studies involving manual recording of data. 

When recording ESP and PK target and response sequences automated equipment, which cannot be tampered with, provides the best insurance against recording errors. If targets and responses are recorded manually, this must be done by individuals who are blind (or unaware) as to the correct targets in order to preclude unconscious errors.

When automated equipment is used, it is critical to good experimental design that automated equipment be subject to periodic tests, during the course of an experiment, to ensure that the equipment functions as it was intended to function.

When computers are used to record psi targets and responses, it is important that the paper printout be kept in its original continuous condition, in order to prevent misplacement of some records. In the event that computer printout paper becomes severed, the paper sheets should be prenumbered. Another precaution would be to keep duplicate records in magnetic storage format.
 

Classification and Scoring Errors

When different experimental conditions (i.e., high-aim and low-aim PK tests) are built into the design, automated recording equipment should be programmed to carefully distinguish between conditions.

When experimental subjects are divided into various classifications, this must not be done on a post hoc knowledge of the subjects' psi scores. The classification must be blind and would ordinarily be completed before psi testing is conducted. Ideally, the basis for classification into high and low scorers should be made public in full detail to the research community before any testing begins. Fixed response alternatives or a preplanned scoring system are essential for such classifications.

In all aspects of data handling, such as computing various statistical parameters from raw data, automated analysis is preferable. If analysis is done manually, it is advantageous that the scorer or statistician be blind as to the various experimental conditions and hypotheses -- in order to avoid inadvertent bias.
 

Statistical/Methodological Violations

Some critics claim that experiments should compare psi scores with scores obtained in some sort of control conditions, rather than simply with the expected statistical means. The reason for this is not that the predictions of probability theory are being called into question -- but rather because a control group may tend to benefit from any otherwise undetected, non-psi sources of information related to methodological flaws, recording errors or sensory leakage. The problem with this approach, however, is that it is not possible to devise a control condition that would be identical to the experimental condition in every way except that it would eliminate possible psi communication. 

In situations where subjects were instructed not to use psi, as in the Princeton RNG studies, it was possible (and seems probable) that subjects used PK to dampen the variability of the RNG so that scores were unusually close to the theoretical mean. This dilemma would seem to place a limit on the degree of precision which is possible to attain in psi research. 

Trial-by-trial feedback, given in studies using a "closed" ESP target sequence (e.g., a deck of cards) violates the condition of independence used for most standard statistical tests. A clever card-counter, such as those who sometimes use the system to win at Blackjack in Las Vegas, can increase the certainty of their guesses on the final cards in a deck. Independence, of course, is not violated when using an "open" sequence in which targets are chosen from a much larger universe. In such situations, card-counting types of strategies would be useless.

Multiple responses for a single target cannot be evaluated using statistical tests that assume independence of responses. In psi research this error is known as the stacking effect. It often occurs with informal classroom or media tests of ESP. If there is a sufficient number of responses, the data can be analyzed by the Greville method, which accounts for the stacking effect. While the stacking effect, however, is a theoretical possibility; empirical tests have shown that for multiple data in typical forced-choice ESP tests, it makes little practical difference whether the results are analyzed by the usual binomial formula, or by the more appropriate Greville method., The situation is quite different in remote-viewing and other free-response experiments where the number of targets is generally small. It then becomes critical to control for response bias.

In remote-viewing experiments, violations of independence have arisen even when a single judge is asked to rank a small number of targets against an equal number of responses. The judge may, under these conditions, be influenced in assigning a rank or rating to a given target by the memory of how he or she assigned ranks or ratings to other targets.

The assumptions used in most statistical tests are violated if optional stopping is used. This could occur by limiting the number of experimental trials at the experimenter's option (particularly after recieving feedback as to success rates) or by optionally limiting the number of experimental subjects. The Princeton method of converting free-response information to binary data avoids this criticism.

Statistical tests are not accurate if researchers are free to censor data which does not support their hypotheses. This sometimes occurs by the use of post hoc decisions as to whether a study will be reported as an informal, preliminary demonstration or as part of an experiment. Ideally all formal psi experiments should be registered, in advance, specifying the total number of trials, runs, subjects, etc. Then all data from those experiments should be reported, regardless of the outcome.

All planned statistical tests must be announced in advance to enable the research community to clearly distinguish between the main analysis and post hoc analysis. When multiple analyses are used statistical tests must take into account the increased possibility of a Type I error -- i.e., the greater likelihood that one of the many tests will attain a probability level of less than .05.

University of California statistician Jessica Utts has pointed out that psi researchers often place too great an emphasis on the probability level of experiments. 


Jessica Utts

This is a mistake, as the validity, magnitude, and reliability of a possible psi effect have very little to do with probability levels. When a great many trials are involved (as in the Princeton RNG studies), a very weak effect can yield probability significance levels that are astronomical. This misunderstanding also results in much confusion with regard to replication in psi research. Utts recommends the use of confidence intervals and power analyses in order to determine the sample sizes necessary to attain significant results in particular experimental designs.
 

Reporting Failures

Experiments must be reported in sufficient detail so that other researchers may attempt independent replication. If this is not possible within the context of a journal article, sufficient detail must be made available in unpublished documentation. All of the details regarding protocol should be included in such reports. When space does not allow publication of full details, provisions must be made for independent release of experimental details to interested researchers. 
 

Experimenter Fraud

Fudging or "tidying" of data, or outright fraud by experimenters is a factor that, given the history and controversies of psi research, must be accounted for in experimental planning and reporting. Within the past two decades, two major psi researchers were caught by their colleagues in acts of fraud. 

While it is impossible to absolutely preclude experimenter fraud in any field of science -- it is possible to design studies in such a fashion that if fraud were to occur it could not logically be attributed to a single deviant scientist, but would rather have to be explained as the result of a conspiracy among members of a research team. The basic methodology for accomplishing this level of safeguard is to ensure that researchers work together in teams, always observing and double-checking each other's work.

Other methodologies for preventing experimenter fraud involve strategies such as sharing of data with outside research teams, establishing various double- and triple-blinds so that researchers not possess data necessary to commit fraud.
 


References

. Much of the material included under this heading is based upon a paper by psychologist Charles Akers titled, "Methodological Criticisms of Parapsychology," in Stanley Kripper (Ed.), Advances in Parapsychological Research, Vol. 4. Jefferson, NC: McFarland, 1984, pp. 112-164.

. W. K. Feller, "Statistical Aspects of ESP," Journal of Parapsychology, 4, 1940, 271-298.

. L. Zusne & W. H. Jones, Anomalistic Psychology: A Study of Extraordinary Phenomena of Behavior and Experience. Hillsdale, NJ: Erlbaum, 1982.

. This constructive suggestion was made by C. E. M. Hansel in a paper commented upon by fellow skeptic Ray Hyman. See Hansel, "A Critical Analysis of H. Schmidt's Psychokinesis Experiments," Skeptical Inquirer, 5, 1981, 26-33.

. R. Wilson, "Deviations from Random in ESP Experiments," International Journal of Parapsychology, 8, 1966, 387-395.

. James W. Davis & Charles Akers, "Randomization and Tests for Randomness," Journal of Parapsychology, 42, 1974, 393-407. 

. F. C. C. Hansen & A. Lehmann, "Ueber Unwillkurliches Flustern," Philosophische Studien, 11, 1895, 471-530.

. J. L. Kennedy, "Experiments on 'Unconscious Whispering,'" Psychological Bulletin, 35, 1938, 526 (Abstract). 

. J. L. Kennedy, "A Methodological Review of Extra-Sensory Perception," Psychological Bulletin, 36, 1939, 59-103. 

. A. Moll, Hypnotism (4th ed.). New York: Scribner, 1898.

. G. M. Stratton, "The Control of Another Person by Obscure Signs," Psychological Review, 28, 1921, 301-314.

. L. Warner & M. Raible, "Telepathy in the Psychophysical Laboratory," Journal of Parapsychology, 1, 1937, 44-51.

. Robert Rosenthal, "Clever Hans: A Case Study of Scientific Method," in O. Pfungst, Clever Hans. New York: Holt, Rinehart & Winston, 1965.

. Robert Rosenthal, Experimenter Effects in Behavioral Research. New York: Appleton-Century-Crofts, 1966.

. George R. Price, "Science and the Supernatural," Science, 122, 1955, 359-367. 

. James W. Davis, "A Developmental Program for the Computer Based Extension of Parapsychological Research and Methodology," Journal of Parapsychology, 38, 1974, 69-84.

. Deborah Delanoy, "Work With a Fraudulent PK Metal-Bending Subject," in D. H. Weiner & R. L. Morris (eds.), Research in Parapsychology 1987. Metuchen, NJ: Scarecrow Press, 1988, pp. 102-105.

. George H. Estabrooks, Spiritism. New York: Dutton, 1947. 

. For an authoritative overview and annotated bibliography, see Earle J. Coleman, Magic: A Reference Guide. New York: Greenwood Press, 1987.

. J. L. Kennedy & H. F. Uphoff, "Experiments on the Nature of Extra-Sensory Perception: III. The Recording Error Criticism of Extra-Chance Scores," Journal of Parapsychology, 3, 1939, 226-245.

. F. D. Sheffield, R. S. Kaufman & J. B. Rhine, "A PK Experiment at Yale Starts a Controversy," Journal of the American Society for Psychical Research, 46, 1952, 111-117.

. Robert Rosenthal, "How Often Are Our Numbers Wrong?" American Psychologist, 33, 1978, 1005-1008.

. James W. Davis, "The Stacking Effect: Its Practical Significance in Parapsychology," Journal of Parapsychology, 42, 1978, 67. 

. Betty M. Humphrey, "Further Work With Dr. Stuart on Interest Ratings and ESP," Journal of Parapsychology, 13, 1949, 151-165.

. Brenda J. Dunne, et al., Precognitive Remote Perception.

. Jessica Utts, "Successful Replication Versus Statistical Significance," in L. A. Henkel & R. E. Berger (eds.), Research in Parapsychology 1988. Metuchen, NJ: Scarecrow Press, 1989, pp. 44-48.

Return to WilliamJames.com
More than a bookstore

Return to Roots of Consciousness Contents