Home Work 1 Problem 1: Provide summary statistics of the demographic variables (Age, Height, Gender) using numerical and/or graphical presentations as appropriate. Indicate the reason(s) for your choice of summary statistics/graphics for each variable. Problem 2: Show how height varies among the three treatment groups using side-by-side boxplots. Problem 3: Use a paired t-test to compare PLUC.pre and PLUC.post. Why is it OK to use a paired t-test for this comparison? Problem 4: Test the Null Hypothesis that boys and girls are equally above average on the LWAS measure. Draw a graph to show the results of this test. ============ R Cheat Sheet The following is an annotated recipe for doing the above homework with R. These instructions assume that you have saved the default dataset (default.csv) on your computer desktop: 1.) Start R graphical user interface (GUI) - you will have a window for entering commands. In the window, the '>' symbol is the command prompt where you type your commands. 2.) Use the R change directory command to set your default directory to your desktop. On Windows R the change dir option is in the Files menu. On Mac, it's in the Misc menu. (This is the step I keep forgetting!!) 3.) Type the command: df = read.csv(file="default.csv") and press return. This command calls a function (named read.csv) to open a file in the current working directory (your Desktop after step 2) and read its contents. The contents are interpreted and stored in an object called a "data frame" (in this case called 'df'). The '=' tells R to assign the output of the read.csv function to 'df' and because read.csv creates a data frame, df will be a data frame. A data frame is sort of analagous to a spread sheet. It has a set of elements that correspond to the columns of the spread sheet and each element has a name, like a column header. Each element contains all the numbers (or other things like words) that would be stored in the rows of the spread sheet within each column. In R, each of these column-like elements are called vectors. A vector is simply a list of numbers or characters. To refer to an individual element of a data frame, us the syntax df$name (assuming 'df' is the name of the data frame, and 'name' is the name of the element (or column)). 4.) A good way to see what the read.csv function did is to ask for a summary of the contents of 'df'. This is done with the command: summary(df). That is, type 'summary(df)' -- without the quotes -- at the '>' prompt and press return. R will give you a summary of each element in 'df'. You'll see somthing like the following (and more): sid age sex hgt grp s01 : 1 Min. : 48.00 f:30 Min. :38.99 Min. :1 s02 : 1 1st Qu.: 64.75 m:30 1st Qu.:43.57 1st Qu.:1 s03 : 1 Median : 84.00 Median :46.69 Median :2 s04 : 1 Mean : 90.42 Mean :47.99 Mean :2 s05 : 1 3rd Qu.:115.25 3rd Qu.:53.18 3rd Qu.:3 s06 : 1 Max. :143.00 Max. :58.09 Max. :3 (Other):54 Note that for non-numeric elements (like sid and sex above) you get a frequency count of up to six items. For numeric elements (like age hgt and grp) you get a parametric summary including mean, median, min, max, and the 1st & 3rd quartiles. 5.) Note that in step 4 some of the numeric elements like grp, shades, ped are treated like interval or ratio data even though they are really intended to be category designations. For instance, children belong to one of three groups; group 1, group 2, or group 3. Here, although we used numbers to indicate the groups, the numbers are really nothing more than names for the different groups. R does not know that is how we intended to use the numbers 1, 2, and 3 for grp. Fortunately, we can tell R to treat the grp element as a categorical vector (analagous to the way 'sex' was treated) and not a numerical vector. We do this by telling R that the element grp in data frame df is a 'factor' using a command of the form: > df$grp = factor(df$grp) This command passed the grp element of df (i.e., df$grp) to the factor function which converts the numerical data to categorical data. The 'df$grp = ' part of the command causes the result of the factor function to be reassigned to the grp element of df. If we did not reassign it that way, the factor function would just print its result in the window and leave the element df$grp unchanged. There are two other elements (df$Shades and df$Ped) that will need to be redefined in the same way so that R functions will know how to handle them correctly. As an alternative to using typed commands to do this, we can also use Rcmdr. In Rcmdr, first make sure the data frame is loaded by clicking the "" button and selecting "df" from the list. Then, select "Data->Manage variables in active dataset->Convert numeric variables to factors". In the list on the left labeled "Variables (pick one or more)", select a variable (say, grp) and click OK. You will get a warning dialog that says the variable already exists and asks if it's OK to overwrite it. Click the Yes button. You will then get another dialog listing each numerical value for 'grp' and giving you a place to enter the name for that factor level. Enter a name for each group (e.g., Placebo, SupplA, SupplB). Note names should not contain spaces or special characters. This tells R we have a factor called "grp" with levels "Placebo, SupplA, and SupplB". You can repeat this process to convert Shades and Ped to factors and assign names to them. Note that when we used the command line approach, we did not assign names to each level of 'grp', but we could have. The command to do that will be displayed in the Rcmdr Script Window. For the example above, the command would be similar to: > df$grp = factor(df$grp, labels=c("Placebo", "SupplA", "SupplB")) 6.) The summary(df) command already answered homework question 1 in large part. You might want to additionally describe age and hgt in terms of their distribution using histograms. In Rcmdr, select Graphs->Histograms.. then choose the variable you want to plot as a histogram from the list. Note that you can set the number of histogram bins explicitly, or let R choose (leave the number of bins set to ""). You can also choose to view Frequency counts, Percentages, or Densities. We recommend using Frequency counts. Here's how you would do this using commands in R for age (hgt works the same way): > hist(df$age) Pretty easy compared to Excel! This is the simplest form of the hist command. For something a little nicer, try: > hist(df$age, main="Histogram of Age", xlab="Age in months") If you want to learn about all the additional 'niceties' that you can add to the hist command, try the R command ?hist (i.e., type '?hist' at a command prompt and press return). If you would like to save a graph, in either Windows or Mac, you can click on the title bar of the graph window (to be sure it's the active window) and select Edit->Copy to copy the graph to the clipboard. In Windows, the graph will be a windows metafile graph that you can paste directly into a word document and resize without any loss in quality. In Mac OS X, you need start the Preview program, and select "File->New from clipboard" to get a PDF copy of the R image file. Preview will then let you save the file as a PDF and Word can import the pdf, but resizing does not work as well. 7.) For homework question 2, in Rcmdr, select Graphs->Boxplot... then click on 'age' in the list of variables and click the button labeled 'Plot by groups'. In the dialog box that appears, select 'grp' and click OK then click OK in the Boxplot dialog as well. The data will be plotted. You can do the same in the regular R command window (not using Rcmdr) with the following command (note that a similar command is shown in the Rcmdr Script Window if you followed the steps above): > boxplot(age ~ grp, data=df) In R, a statement like "age ~ grp" translates loosely to "age separated by, or predicted by, grp". The command tells the boxplot function to look for elements called age and grp in a data frame called df. As with the hist function, there are may additional tweaks you can apply to make a more elaborate graph with labels and so forth. Read about them using the ?boxplot command. 8.) For homework question 3, in Rcmdr, select Statistics->Means->Paired t-test. In the dialog box that pops up, select PLUC.pre as the "First variable" and select PLUC.post as the "Second variable". Leave the "Two-sided" button selected and .95 Confidence level and click OK. Note the command that Rcmdr generated in the "Script Window" after you clicked OK. You can experiment with this and variants of it in the regular R command window if you wish to experiment with doing a t.test by hand. 9.) For homework question 4, see if you can figure this one out yourself!