Home Work 1 Problem 1: Provide summary statistics the demographic variables (Age, Height, Gender) using numerical and/or graphical presentations as appropriate. Indicate the reason(s) for your choice os summary statistics/graphics for each variable. Problem 2: Show how height varies among the three treatment groups using side by side boxplots. Problem 3: Name three things we look for numerical data summarization. Problem 4: Calculate mean, median, standard deviation, quartiles and percentiles of the following data set: {13, 7, 15, 14, 9, 10, 12, 14} Problem 5: Describe the 68-95-99.7 rule for the normal distribution. Problem 6: Describe the relationship between variables Age and Height, and between PLUC.pre and PLUC.post using scatterplots. ============ R Cheat Sheet The following is an annotated recipe for doing the above homework with R. These instructions assume that you have saved the default dataset (default.csv) on your computer desktop: 1.) Get and Install R from the main website 2.) Start R graphical user interface (GUI) - you will have a window for entering commands. In the window, the '>' symbol is the command prompt where you type your commands. 3.) Use the R change directory command to set your default directory to your desktop. On Windows R the change dir option is in the Files menu. On Mac, it's in the Misc menu. 4.) Type the command: df = read.csv(file="default.csv") and press return. This command calls a function (named read.csv) to open a file in the current working directory (your Desktop after step 3) and read its contents. The contents are interpreted and stored in an object called a "data frame" (in this case called 'df'). The '=' tells R to assign the output of the read.csv function to 'df' and because read.csv creates a data frame, df will be a data frame. A data frame is sort of analagous to a spread sheet. It has a set of elements that correspond to the columns of the spread sheet and each element has a name, like a column header. Each element contains all the numbers (or other things like words) that would be stored in the rows of the spread sheet within each column. In R, each of these column-like elements are called vectors. A vector is simply a list of numbers or characters. To refer to an individual element of a data frame, us the syntax df$name (assuming 'df' is the name of the data frame, and 'name' is the name of the element (or column)). 5.) A good way to see what the read.csv function did is to ask for a summary of the contents of 'df'. This is done with the command: summary(df). That is, type 'summary(df)' -- without the quotes -- at the '>' prompt and press return. R will give you a summary of each element in 'df'. You'll see somthing like the following (and more): sid age sex hgt grp s01 : 1 Min. : 48.00 f:30 Min. :38.99 Min. :1 s02 : 1 1st Qu.: 64.75 m:30 1st Qu.:43.57 1st Qu.:1 s03 : 1 Median : 84.00 Median :46.69 Median :2 s04 : 1 Mean : 90.42 Mean :47.99 Mean :2 s05 : 1 3rd Qu.:115.25 3rd Qu.:53.18 3rd Qu.:3 s06 : 1 Max. :143.00 Max. :58.09 Max. :3 (Other):54 Note that for non-numeric elements (like sid and sex above) you get a frequency count of up to six items. For numeric elements (like age hgt and grp) you get a parametric summary including mean, median, min, max, and the 1st & 3rd quartiles. 6.) Note that in step 5 some of the numeric elements like grp, shades, ped are treated like interval or ratio data even though they are really intended to be category designations. For instance, children belong to one of three groups; group 1, group 2, or group 3. Here, although we used numbers to indicate the groups, the numbers are really nothing more than names for the different groups. R does not know that is how we intended to use the numbers 1, 2, and 3 for grp. Fortunately, we can tell R to treat the grp element as a categorical vector (analagous to the way 'sex' was treated) and not a numerical vector. We do this by telling R that the element grp in data frame df is a 'factor' using a command of the form: > df$grp = factor(df$grp) This command passed the grp element of df (i.e., df$grp) to the factor function which converts the numerical data to categorical data. The 'df$grp = ' part of the command causes the result of the factor function to be reassigned to the grp element of df. If we did not reassign it that way, the factor function would just print its result in the window and leave the element df$grp unchanged. There are two other elements (df$Shades and df$Ped) that will need to be redefined in the same way so that R functions will know how to handle them correctly. 7.) The summary(df) command already answered homework question 1 in large part. You might want to additionally describe age and hgt in terms of their distribution using histograms. Here's how you would do this in R for age (hgt works the same way): > hist(df$age) Pretty easy compared to Excel! This is the simplest form of the hist command. For something a little nicer, try: > hist(df$age, main="Histogram of Age", xlab="Age in months") If you want to learn about all the additional 'niceties' that you can add to the hist command, try the R command ?hist (i.e., type '?hist' at a command prompt and press return). If you would like to save a graph, in either Windows or R, you can click on the title bar of the graph window and select Edit->Copy to copy the graph to the clipboard. In Windows, the graph will be a windows metafile graph that you can paste into a word document and resize without any loss in quality. In Mac OS X, the image will be a bitmap and should not be resized. 8.) For homework question 2, use the following command: > boxplot(age ~ grp, data=df) In R, a statement like "age ~ grp" translates loosely to "age separated by, or predicted by, grp". The command tells the boxplot function to look for elements called age and grp in a data frame called df. As with the hist function, there are may additional tweaks you can apply to make a more elaborate graph with labels and so forth. Read about them using the ?boxplot command. 9.) For question 4, you need to put the list of numbers into a vector so that R can operate on them. The command to do this is: > v = c(13, 7, 15, 14, 9, 10, 12, 14) In this command, we created a new variable, a vector called 'v' and assigned the numbers 13, 7, ... 14 to it. In case you are wondering, c(...) is a function that simple concatenates all the elements of a comma separated list into an R vector variable. Now, the summary command (i.e., summary(v)) will get you most of the answer to this homework question. To get the standard deviation, try: > sd(v) Forget about the 'percentiles' part of Q4. 10.) For homeword question 6, a scatterplot plots the values of one vector against the values of another vector. Naturally, R has a nice function for doing this: > plot(df$PLUC.pre, df$PLUC.post) The result will be a very scattered plot indeed! This tells us that there is very little in the way of a consistent relationship between PLUC at the onset of the study and PLUC at the end of the study. In other words, the amount of PLUC a subject had at the beginning of the study had nothing to do with the amount of PLUC they ended up with at the end of the study. Just to expand on this part of the homework (since it's so easy in R), try plotting hgt versus age in the same way: > plot(df$age, df$hgt) Notice that there is a very consistent pattern relating age and height such that the older a subject is, the taller they are. The relationship is not perfect, however; around any given age, there is a range of heights that subjects may have. We will talk a lot more about this in the next class.