Results: Descriptive and Inferential Statistics
30 Conducting Your Analyses
Even when you understand the statistics involved, analyzing data can be a complicated process. It is likely that for each of several participants, there are data for several different variables: demographics such as sex and age, one or more independent variables, one or more dependent variables, and perhaps a manipulation check. Furthermore, the “raw” (unanalyzed) data might take several different forms—completed paper-and-pencil questionnaires, computer files filled with numbers or text, videos, or written notes—and these may have to be organized, coded, or combined in some way. There might even be missing, incorrect, or just “suspicious” responses that must be dealt with. In this section, we consider some practical advice to make this process as organized and efficient as possible.
Prepare Your Data for Analysis
Whether your raw data are on paper or in a computer file (or both), there are a few things you should do before you begin analyzing them. First, be sure they do not include any information that might identify individual participants and be sure that you have a secure location where you can store the data and a separate secure location where you can store any consent forms. Unless the data are highly sensitive, a locked room or password-protected computer is usually good enough. It is also a good idea to make photocopies or backup files of your data and store them in yet another secure location—at least until the project is complete. Professional researchers usually keep a copy of their raw data and consent forms for several years in case questions about the procedure, the data, or participant consent arise after the project is completed.
Next, you should check your raw data to make sure that they are complete and appear to have been accurately recorded (whether it was participants, yourself, or a computer program that did the recording). At this point, you might find that there are illegible or missing responses, or obvious misunderstandings (e.g., a response of “12” on a 1-to-10 rating scale). You will have to decide whether such problems are severe enough to make a participant’s data unusable. If information about the main independent or dependent variable is missing, or if several responses are missing or suspicious, you may have to exclude that participant’s data from the analyses. If you do decide to exclude any data, do not throw them away or delete them because you or another researcher might want to see them later. Instead, set them aside and keep notes about why you decided to exclude them because you will need to report this information.
Now you are ready to enter your data in a spreadsheet program or, if it is already in a computer file, to format it for analysis. You will most likely use a statistical analysis program like SPSS, STATA or R to create your data file. (Data files created in one program can usually be converted to work with other programs.) The most common format is for each row to represent a participant and for each column to represent a variable (with the variable name at the top of each column). A sample data file is shown in Table 30.1. The first column contains participant identification numbers. This is followed by columns containing demographic information (sex and age), independent variables (mood, four self-esteem items, and the total of the four self-esteem items), and finally dependent variables (intentions and attitudes). Categorical variables can usually be entered as category labels (e.g., “M” and “F” for male and female) or as numbers (e.g., “0” for negative mood and “1” for positive mood). Although category labels are often clearer, some analyses might require numbers. SPSS allows you to enter numbers but also attach a category label to each number.
ID | SEX | AGE | MOOD | SE1 | SE2 | SE3 | SE4 | TOTAL | INT | ATT |
1 | M | 20 | 1 | 2 | 3 | 2 | 3 | 10 | 6 | 5 |
2 | F | 22 | 1 | 1 | 0 | 2 | 1 | 4 | 4 | 4 |
3 | F | 19 | 0 | 2 | 2 | 2 | 2 | 8 | 2 | 3 |
4 | F | 24 | 0 | 3 | 3 | 2 | 3 | 11 | 5 | 6 |
If you have multiple-response measures—such as the self-esteem measure in Table 30.1—you could combine the items by hand and then enter the total score in your spreadsheet. However, it is much better to enter each response as a separate variable in the spreadsheet and use the software to combine them (e.g., using the “Compute” function in SPSS). Not only is this approach more accurate, but it allows you to detect and correct errors, to assess internal consistency, and to analyze individual responses if you decide to do so later.
Preliminary Analyses
Before turning to your primary research questions, there are often several preliminary analyses to conduct. For multiple-response measures, you should assess the internal consistency of the measure. Statistical programs like SPSS will allow you to compute statistics such as Cronbach’s α, which may sound familiar if you’ve taken the ‘survey’ track in this course.
Next, you should analyze each important variable separately. (This step is not necessary for manipulated independent variables, of course, because you as the researcher determined what the distribution would be.) Make histograms for each one, note their shapes, and compute relevant descriptive statistics. Be sure you understand what these statistics mean in terms of the variables you are interested in. For example, a distribution of satisfaction ratings on a 1-to-10-point scale might be unimodal and negatively skewed with a mean of 8.25 and a standard deviation of 1.14. But what this means is that most participants rated themselves fairly high on the satisfaction scale, with a small number rating themselves noticeably lower.
Now is the time to identify outliers, examine them more closely, and decide what to do about them. You might discover that what at first appears to be an outlier is the result of a response being entered incorrectly in the data file, in which case you only need to correct the data file and move on. Alternatively, you might suspect that an outlier represents some other kind of error, misunderstanding, or lack of effort by a participant. For example, in a reaction time distribution in which most participants took only a few seconds to respond, a participant who took 3 minutes to respond would be an outlier. It seems likely that this participant did not understand the task (or at least was not paying very close attention). In situations like this, it can be justifiable to exclude the outlying response or participant from the analyses. If you do this, however, you should keep notes on which responses or participants you have excluded and why, and apply those same criteria consistently to every response and every participant. When you present your results, you should indicate how many responses or participants you excluded and the specific criteria that you used. And again, do not literally throw away or delete the data that you choose to exclude. Just set them aside because you or another researcher might want to see them later.
Keep in mind that outliers do not necessarily represent an error, misunderstanding, or lack of effort. They might represent truly extreme responses or participants.
Planned and Exploratory Analyses
Finally, you are ready to answer your primary research questions. When you designed your study, you might have had a hypothesis that a particular relationship might exist in the data. In this case, you would conduct a planned analysis, to test a relationship that you expected in your hypothesis. For example, if you expected a difference between group or condition means, you can compute the relevant group or condition means and standard deviations, make a bar graph to display the results, and compute Cohen’s d. If you expected a correlation between quantitative variables, you can make a line graph or scatterplot (be sure to check for nonlinearity and restriction of range) and compute Pearson’s r.
Once you have conducted your planned analyses, you could move on to examine the possibility there might be relationships in the data that you did not hypothesize. This would be an exploratory analysis, an analysis that you are undertaking without an existing hypothesis. These analyses will help you explore your data for other interesting results that might provide the basis for future research (and material for the discussion section of your paper).
However, it is important to differentiate planned from exploratory analyses in writing your results and discussion sections of your report. This is because complex sets of data are likely to include “patterns” that occurred entirely by chance, and every time you do another unplanned analysis on these data, you increase the likelihood these chance patterns will appear to be real patterns, what is referred to as a “Type 1” error (again, covered in your statistics class). Thus results discovered while doing exploratory analyses (one statistician has called this a “fishing expedition”) should be viewed skeptically and replicated in at least one new study before being presented. But, if you do find interesting relationships you did not expect in the data, explain that they might be worthy of additional research. It’s also important to note that increasingly many academics regard such results skeptically, and suggest -if at all possible- to preregister what you’re planning to do in advance. This involves filling in a detailed plan with your hypotheses, planned analyses and other relevant choices in some place which is accessible to anyone before actually collecting your data and performing your analyses, and subsequently sticking to that plan.
Understand Your Descriptive Statistics
Before starting with inferential statistics -a set of techniques for deciding whether the results for your sample are likely to apply to the population. Although inferential statistics are important, beginning researchers sometimes forget that their descriptive statistics really tell “what happened” in their study. For example, imagine that a treatment group of 50 participants has a mean score of 34.32 (SD = 10.45), a control group of 50 participants has a mean score of 21.45 (SD = 9.22), and Cohen’s d is an extremely strong 1.31. Although conducting and reporting inferential statistics (like a t test) would certainly be a required part of any formal report on this study, it should be clear from the descriptive statistics alone that the treatment worked. Or imagine that a scatterplot shows an indistinct “cloud” of points and Pearson’s r is a trivial −.02. Again, although conducting and reporting inferential statistics would be a required part of any formal report on this study, it should be clear from the descriptive statistics alone that the variables are essentially unrelated. The point is that you should always be sure that you thoroughly understand your results at a descriptive level first, and then move on to the inferential statistics.