Search

# Working with Data Frames: Step Eight in Learning R Programming for Free

I hope you are enjoying the “Learning R Programming for Free” series; here are links to the previous segments (Step One, Step Two, Step Three, Step Four, Step Five, Step Six, Step Seven) to provide some helpful background.

In the previous installment, we learned to work with vectors and how to use functions to complete operations on the vectors.

In this discussion, we look at the data frame.  A data frame is a table or a two-dimensional array-like structure where each column contains values and each row contains one set of values for each column. The data stored in a data frame can be of data types: numeric, factor or character.   Data frames are made up of vectors, (numeric, character, or logical), factors, numeric matrices, lists, or other data frames.

In basic R, there is a built-in data frame named “mtcars.”  Let’s look at the contents of “mtcars:”

# List the entire data set:

Mtcars

Test if mtcars is a data frame:

> is.data.frame(mtcars) [1] TRUE

# List a few rows from “mtcars”:

The first line of the table is called the “header” and it contains the column names. Each horizontal line below the header is a “data row.” A data row begins with the name of the row and is followed by the row data. Each data member of a row is called a “cell” just like in your favorite spreadsheet tool.

To retrieve data in a cell, we provide the row and column coordinates in single square brackets [ ]. We can specify more than one coordinate (cell) by making a comma-separated list.

Let’s introduce some useful searching techniques. Suppose we want to find the car with the worst fuel  mileage:

And which car has the best fuel mileage?

Notice how we used “min” and “max” to filter the search.

Let’s look at some more sophisticated searches using “which”:

Which cars have the worst fuel mileage?

Are we annoyed yet, that the car brand/model column has no label?  Let’s fix that:

mydf <- cbind(rownames(mtcars), mtcars) rownames(mydf) <- NULL colnames(mydf) <- c(“brand”,”mpg”,”cyl”,”disp”,”hp”,”drat”,”wt”,”qsec”,”vs”,”am”,”gear”,”carb”)

Now we can use “mtcars” as the named data frame: mydf.

Let’s look at the Porsche.  Using grep, we can now search our new “brand” column:

mydf[grep(“porsche”, mydf\$brand, ignore.case=T),]

MTCARS Data Set Columns:

[, 1] mpg Miles/(US) gallon [, 2]  cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (1000 lbs) [, 7] qsec 1/4 mile time [, 8] vs Engine (0 = V-shaped, 1 = straight) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors

Let’s add a new row using “rbind”. We need a few more high-performance cars in our list.  Let’s add a new car.

df2 = data.frame(brand=”Porsche 911 Turbo S”,mpg=21,cyl=6,disp=231,hp=580,drat=3.44,wt=3.528,qsec=10.5,vs=””,am=0,gear=7,carb=0)

This appends the object “df2” as a new row with all of the values:

mydf3 <- rbind(mydf,df2)

When we do this, we need to be careful all of our members have the same data types and lengths.

In this final section, we will discuss sorting our data.

Which cars have the best fuel mileage but have the quickest ¼ mile time?

# sort by quarter mile time, let’s include the Turbo S:

> sorted_cars <- mydf3[order(mydf3\$qsec),] > sorted_cars

Let’s sort by quarter mile time and cylinders.  Ascending order is the default:

sorted_cars <- mydf3[order(mydf3\$qsec, mydf3\$cyl),]

Now, sort by “mpg” (ascending) and “cyl” (descending).  To sort in descending order, add the ”-“ sign in front of the column name:

sorted_cars <- mydf3 [order(mydf3\$qsec, -mydf3\$cyl),]

Before we go, let’s look at “attach” and “detach” functions and the pros and cons of the attach function.  If we use attach, we can refer to the colums in our data set as local variables, so we do not need to prefix our columns with the data set names.

Let’s repeat the R statements from our sort examples but, add attach:

attach(mtcars)

sorted_cars <- mtcars[order(qsec),]

So we see that if we use “attach()”, we can build the sorted_cars object without prefixing “qsec” with “mydf3\$.”

The “attach” function works well with single data sets, but can become problematic when we have more than one data set we want to work with. What if we load two data sets and each data set has common column names, such as “product_id?”

df_pline1 = data.frame(product_id=”1234″,product_desc=”Mini Wombat”, sku=”1234-83838″)

df_pline2 = data.frame(product_id=”567″,product_desc=”Surf Shirt”, sku=”9912-12345″)

In this case, using the shorthand “attach” would fail, because each product line has the exact same column names.

To clear the data set from our work space, use “detach()”:

detach(mtcars)

In our next installment, we will continue our discussion of R with a discussion about how to plot using “ggplot()”.