How to subset data in R

How to subset a certain column

data.frame$variable.name

or

data.frame[ , # of the column]

or

data.frame[ , "variable.name"]

All three options above are the same, we are choosing a certain column. For example, if we have a data frame survey which consists of 1,000 observations, and each observation is described by 3 variables: gender, age, and marital status, by using survey$age we can subset the column named “age” for all 1,000 observations.

How to subset a certain observation in the data

data.frame$variable.name[# of the row/observation you want to choose]

It is called a dollar-sign notation.

data.frame[# of the row, # of the column]

You also can choose several row and/or columns by using “:”. For example, survey[1:20, 2]. Here, you choose the information regarding the age of the first twenty observations. This method of subsetting is called row-and-column notation.

Data.frame[# of the row, "variable.name"]

For example, survey[1:20, “weight”] will give you the same result as the previous option.

Both row-and-column notation and dollar-sign notation are widely used. Which one to use depends on your preference.

How to subset observations that have specific characteristics

data.frame$variable.name == "specific.characteristic"

For example, survey$gender == “male”. It produces a series of TRUE and FALSE values where TRUE indicates that the person was male.

or

data.frame$variable.name > (>=, <=, <, etc.) some.number

For example, survey$age > 30.

If you want to extract just the information for observations with the specific characteristic (only men in the sample), you can use subset().

For example,

subset(survey, survey$gender == "male")

will return a data frame in regard to the men from our initial data frame.

You also can use the logical operators & and | in subsetting.

The & is read “and”.

subset(survey, survey$gender == "male" & survey$age > 60)

will give you the data for men over the age of 60.

The | is read “or”.

subset(survey, survey$age < 21 | survey$age > 60)

will give you the data for people under the age of 21 or over the age of 60, so that you will have people under 21 and over 60 together in your sub set.

Leave a Reply

Your email address will not be published.