Basics of Measurement


Science is based on objective observation of the changes in variables. The greater our precision of measurement the greater can be our confidence in our observations. Also, measurements are always less than perfect, i.e., there are errors in them. The more we know about the sources of errors in our measurements the less likely we will be to draw erroneous conclusions. This discussion presents some of the terms and operations that are a part of measurement.

The first set of terms to define are the four terms that make up the Scales of Measurement. There are four scales of measurement and being able to discern which scale to use is paramount in selecting the correct research design and analysis tools. The scales are nominal, ordinal, interval, and ratio.

A nominal scale is a set of categories that have no set order or hierarchy of values. A simple nominal scale is used in the variable Treatment, where we have two categories: 1) Subjects get treated, or 2) subjects do not get treated. There is no order to this scale. The categories just exist, and we use them to define a variable.

An ordinal scale is a set of categories that have order, but where we do not know the distance between the categories, and where the distance between one pair of categories may be different from the distance between another pair. An example would be a simple scale for hardness, where 1 = scratch with fingernail, 2 = scratch with penny (copper), and 3) scratch with a diamond (carbon). With this scale we can grade items depending on their hardness into three categories that range from soft to hard. However, the increase in hardness from my fingernail to a penny is much smaller than the increase in hardness from the penny to the diamond. Thus this scale will let us order items, but it will not let us get an exact measurement, i.e., we can say that a piece of iron is harder than a piece of wood because the penny will scratch the wood but not the iron, but we cannot say "how much harder" is the iron.

An interval scale has order and equal distances between each category. Thus, a ruler or a thermometer use interval scales. The ruler uses the inch or the millimeter, and the thermometer uses degrees. Each inch or degree is the same size, so a table that is 24 inches wide is exactly twice as large as a table that is 12 inches wide. Interval scales let us say how much longer or hotter, or whatever, one thing is compared to another thing.

Finally, a ratio scale is an interval scale that has a true zero. Inches are a ratio scale, but the Fahrenheit or Celsius scales are interval. If an item is zero inches long then its not there, thus zero inches truly means zero. If the temperature is zero degrees Celsius, then water may freeze but your heat pump can still heat your house. Why? Because there is still some warmth in air that is zero degrees Celsius. The Kelvin scale for temperature is a ratio scale. Why?

Types of Variables and Descriptive Statistics

You are already familiar with independent, dependent, and control variables. These are names we give to variables depending on how they are used in a study. The same variable can, in different situations, be an independent, dependent, or control variable. When we measure a variable, be it independent, dependent, or control, we classify the variable as either continuous or categorical.

1. Continuous variables can take on numerical values (1,2,3, ... ,N), where there are equal units of measurement between the numerical values. This means that the distance between 1 and 2 is the same as between 2 and 3. Continuous variables are measured using either interval scales or ratio scales. Continuous variables can be analyzed by getting the mean and the variance. The mean is the average value of a set of scores.

The variance tells us how the variable changes across subjects. The variance is the average squared deviation around the mean. This value is hard to relate to the mean because the value is based on squared values of x. If we take the square root of the variance we get the standard deviation. The standard deviation is the average deviation of the scores around the mean; this is easier to interpret (really!).

Another measure of dispersion is Range. The range of a variable is the distance between the minimum and maximum values the variable takes.

2. Categorical variables also take on numerical values, but the measurement scale we use is the nominal scale. For example, we might have the variable called religious preference. We would have several categories: Christian, Jewish, Moslem, and Buddhist. For convenience we can number each category 1, 2, 3, and 4 respectively, but the numbers have no meaning, i.e., being a 1 is not better or worse than being 3.

We can count the frequencies in each category, but we cannot get the mean, or standard deviation of a nominal variable. We can compute the mode of a categorical variable. The mode is the category with the greatest frequency.

Independent variables (IVs) are often categorical. When we do a study comparing two different treatments, we will have two groups of subjects; one group gets the first treatment and the other group gets the second. This study has one IV (treatments) with two categories (treatment 1 and treatment 2).

3. Ordinal variables are a third type of variable that are classified as either categorical or continuous depending on one's preference and how they are used. This third type is a variable that is measured using an ordinal scale. For example, if we arrange ten people from the tallest to the shortest. We can number the tallest as 1, the next tallest as 2, and so on until the shortest is numbered as 10. An ordinal scale is different from an interval scale in that there are NOT equal units of measurement between the numerical values.

In mathematics you cannot obtain the mean of an ordinal variable, because the ranks (1, 2, 3, etc.) are not equally spaced. This means that the difference between ranks 1 and 2 will be larger (or smaller) than the difference between ranks 3 and 4.

Attitudes are often measured with a rating scale. For example we might ask someone to rate their preference for ice cream on this 5 - point scale:

LoveLikeNeutralDislikeHate
12345

If we decide there are equal distances between each rank (i.e., the intervals are equal), then researchers often assume it is an interval scale and compute means and standard deviations. This is not an entirely correct assumption to make because if the intervals are not really equal then it is still an ordinal scale no matter what we assume.

If you do not want to assume the intervals are equal you can compute the median rank. The median rank is the rank that falls in the middle of the distribution of ranks. For example: If we have 20 people rate their preference for ice cream (where 1 = "I hate ice cream" and 5 = "I love ice cream") the data might look like this:

1 2 2 2 3 3 3 4 4 4 4 4 5 5 5 5 5 5 5 5

The median rank is 4, because 10 ratings are 4 or above, and 10 ratings are 4 or below. The mode for this data is 5. The mean is 3.8 and the standard deviation is 1.3.

Properties of Distributions

Many human characteristics such as height, weight, and income are distributed throughout the world as symmetrical distributions. If we measure the heights of a large number of people in inches and plot them so that height in inches is along the bottom axis and frequency is along the vertical axis, we will get a symmetrical distribution. This symmetrical distribution is often called a normal distribution. This curve is useful because it has many properties. Data distributed normally are measured using an interval or ratio scale. Thus, you can compute the mean and standard deviation. Also, certain statistical procedures, called parametric tests, can be used with normally distributed data. With a symmetrical distribution the mean, median, and mode all fall approximately at the same point. If our data falls into a normal distribution, about 68% of the values lie within the mean plus one standard deviation (sd) and the mean minus one sd. It is this property that aids us in using the standard deviation to understand the variability in the scores.

We can compare two distributions if we know their means and standard deviations (sd). For example: we have two sets of test scores for the research class. Test A has a mean of 20 and a sd of 9 and Test B has a mean of 21 and a sd of 3. The means tell us that overall the two groups are similar. The standard deviations tell us that Test A was easier for some and harder for others than Test B. We can say this because Test A has a very large standard deviation and Test B a rather small one. For Test A, 68% of the scores lay between 11 and 29, while for Test B, 68% of the scores lay between 18 and 24. A researcher would say Test A had more variability then Test B.

  The table below summarizes the scales of measurement and some of their distinguishing characteristics.

Summary of Scales of Measurement

Scale

How used in a study?

Characteristics

Categorical

Continuous

Nominal

Yes

No

Can compute Mode only. Frequencies data. All IVs in differences studies are nominal and categorical

Ordinal

Yes

Sometimes

Can compute Median or Mode if used as a categorical variable or Mean if assumed to be continuous. Data are ranks

Interval

No

Yes

Can compute Mean, Median or Mode as desired. Measurement in inches, pounds, number of items answered correctly, or percentages.

Ratio

No

Yes

Reliability and Validity of Measurement

When we decide to study a variable we need to devise some way to measure it. Some variables are easy to measure and others are very difficult. For example, measuring your eye color is easy (blue, brown, grey, green, etc.), but measuring your capacity for creativity is very difficult (For example, compose a sonnet that is both original and profound?).

We try to develop the best measures we can whenever we are doing research. A good measuring instrument or test is one that is reliable and valid. We will look at test validity first.

Test Validity refers to the degree to which our measuring strategy (instrument, machine, or test) measures what we want to measure. This sounds obvious; right? Well sometimes it is and sometimes it is not. For example: what is a valid measure of height (a ruler?), weight (a scale?), intelligence (an IQ test?), attitude towards God (going to church/not going to church?), mathematical ability (find the length of the hypotenuse of a right triangle?), etc. As you can see some variables can be difficult to measure.

A valid measure is one that accurately measures the variable you are studying. There are four ways to establish that your measure is valid: content, construct, predictive, and concurrent validity.

1. Content validity is established if your measuring instrument samples from the areas of skill or knowledge that compose the variable, i.e., if a test on addition has a good selection of 2 + 2 type problems then it is probably valid.

2. Construct validity is based on designing a measure that logically follows from a theory or hypothesis. For example: suppose creativity is defined as the ability to find original solutions to problems. I design a test for creativity where subjects are to list as many uses for a paper clip as possible. I designate subjects who list more than 30 uses as creative. I have developed a test with construct validity. The test is valid to the extent that the task (uses for a paper clip) is a logical application of my theory about creativity. If my theory is wrong or if my measure is not a logical application of the theory, then the measure is not valid.

3. Predictive validity refers to the ability of my measure to separate subjects who possess the attribute I am studying from those who do not. If I design a test of aptitude for flying an airplane, it has predictive validity if subjects who score high learn to fly, and if subjects who score low crash.

4. Concurrent validity is used when a valid measure exists for your variable but you want to design another measure that is perhaps easier to use or faster to take. Suppose you design a short test for manual dexterity to replace a much longer one. In this case you have subjects take both the old and new tests. Your new test has concurrent validity if the subjects make similar scores on both tests. Concurrent and predictive validity are similar.

Reliability is the consistency with which our measure measures. If you cannot get the same answer twice with your measure it is not reliable. A ruler is reliable. You and I can use a ruler to measure this page and we will both conclude that it is 8.5 inches by 11 inches. A measuring strategy can be reliable and not valid, but if the instrument is not reliable it is also not valid.

Problems with reliability occur when we are measuring more abstract variables. For example, when measuring the skill of a diver, we use several judges, who apply standards to each type of dive. The judges often do not agree exactly on the rating of each dive. But, if the judges are all pretty close to each other (say 8.5, 8.5, 8.0, and 9.0) we conclude that they are able to apply the standards of a good dive to the diver's performance, and that our measure is reliable. Our measure in this case has two components: 1) the standards for a good dive, and 2) training the judges to apply the standards the same.

Measurement is never exact. If you and I measured this page with a ruler divided into 100ths of an inch, I might say it is 8.51 inches wide and you might say it is 8.49 inches wide. At some point our measures always break down and errors creep into our data. This is when the concept of Error of Measurement becomes important.

In order to be able to use any measure we need to know its error of measurement. Error of measurement refers to the difference between the measurement we obtain and the "true" value of the variable. Question: Where do you get the "true" measure if all measuring methods produce errors? Answer: "True" measures cannot be obtained, but they can be estimated.

In Chapter 8 - Interpreting Correlations we computed the correlation to estimate the reliability of a test. The correlation coefficient (rxy) computed in Chapter 8 was .88. This value means you can predict one test score from the second and that the error of prediction is fairly low. We would conclude that this test is reliable. Unless the correlation coefficient is 1.00 (or -1.00) then there is some error in the prediction. The degree of error can be calculated.

For the data in the Chapter 8 example the Standard Error of Measurement (Smeas) is .62. What does this mean? The Smeas is the expected standard deviation of scores for any person who takes a large number of parallel tests. If a person took many parallel tests about Mars, then our Smeas of .62 is the standard deviation of those test scores around the true score of that person's knowledge, i.e., the mean of many administrations of parallel tests is a close estimate of their true score. Since our example is based on a ten item test and the scores are the number of items answered correctly, then if someone got 7 on the test, we can use the Smeas to calculate a range. The person's true ability will lie inside this range. Earlier we mentioned that the range lying one standard deviation above and one standard deviation below the mean encompassed approximately 68% of the scores. If we add and subtract the Smeas from the mean, this resulting range will capture approximately 68% of the person's possible scores from multiple testings. Thus, for a person with a score of 7.0, their true score has a good probability of lying between 6.38 and 7.62. If we wanted to be very confident that the person's true score was in the range we can add and subtract two Smeas, and this range will encompass 95% of the possible scores. Finally, we can add and subtract three Smeas, and the range will capture 99% of the possible scores.

The larger the Smeas the more error there is in our measuring instrument. If there is too much error in our measuring instrument then it will not provide us with useful data. A good measuring strategy is reliable and, because it is reliable, it has a small amount of error in its observations.