10/25/2008

Measurement in educational research and assessment

I. Preestablished Instruments and Archival Data
- data: numerical (attendance rate), verbal (interview transcript) and graphic (picture

- types of data: preestablished and self-developed
Preestablished Instruments
Preestablished instruments---standardized: already- made or piloted measuring tools including tests, observational rating scales, questionnaires or scoring protocols for interviews.
- Standardized instruments characteristics:
● a fixed set of questions or stimuli; fixed time frame, fixed set of instruction and identified responses; measures specific outcomes and is subjected to extensive research development and review;
● performance can be compared to a referent such as a norm group, a standard or criterion, or an individual’s own performance
> Norm-referenced tests: comparing student’s performance with that of a norm group (eg. TOEFL)
> Criterion-referenced tests: comparing against a predetermined standard of performance
> Self-referenced tests: measuring an individual student’s performance over time to see if it improves or declines when compared with past performance.
- Five Types of Preestablished measures: achievement, aptitude, personality, attitude or interest, behaviors.
1. achievement tests—measuring what has already been learnt
2. aptitude tests—predicting what one can do or how one will perform in the future
3. personality tests—measuring self-perception or personal characteristics, traits, or behaviors
4. attitude or interest scale—assessing attitude toward a topic or interests in areas
5. behavior rating scales—making diagnoses of problems, frequency or intensity of behavior
Archival Data
Archival data: have already been collected by individual teacher, school, or district rather than by a researcher. Examples of archival data are student absenteeism, graduation rates, suspensions, standardized state test scores, and teacher grade-book data.

II. Scales of Measurement
Specifically, there are 4 types of variables that can be measured quantitatively, and these are called levels or scales of measurement.
● Nominal scales—measuring variables that are categorical or classes.
● Ordinal scales—numberings of different levels or ranks. The distance is not equal; this limits the types of statistical tests that can be applied to the data and also limits the conclusions about differences between persons at different ranks.
● Interval scales—numberings of different level in which the distances, or intervals, between the levels are equal. It is absent of true zero score, in this case to say that a score of zero means the absence of something is not totally accurate.
● Ratio scales—(considered producing the most precise data) including the properties of nominal, ordinal, and interval and also include a true zero point.

III. Summarizing Data Using Descriptive Statistics
Descriptive statistics: to show the patterns in the data.
Frequency Distribution
- distribution: describe the range of scores and their overall frequencies. Scores may also be displayed in a graph. For ordinal scale of measurement of data, they are usually connected by a line, and the graph is referred to as a frequency polygon. If categorical, the graph is called histogram.
> a Normal distribution—looks like a bell-shaped and symmetrical, with the highest point on the curve and most frequent scores clustered in the middle of the distribution: when large groups are randomly sampled and measured.
> Skewed distributions: asymmetrical, meaning the scores are distributed differently at the two ends. – a negatively skewed distribution—most of the scores are high, but there are a small number of scores that are low.
+ a positively skewed distribution—most of the scores are low, but there are a small number of scores that high (Fig. 4.3, p. 78).
- The “outlier” in a skewed distribution pull the “tail” of the distribution out in that direction.
> Bimodal distributions have two clusters of frequent scores as the two humps in the distribution (Fig. 4.4, p. 79).
What is a Typical or Average Score? Measures of Central Tendency
Three common measures are the mode, mean, and median.
Mode: the score occurring most frequently. If the distribution is asymmetrical, the mode may not be a precise estimate of central tendency.
Mean: arithmetical average of a set of scores. In a skewed distribution, the mean can be misleading.
Median: the score that divides a distribution exactly in half when scores are arranged from the highest to the lowest. It is a stable measure of the central tendency of a set of scores although there are a few outlier scores in a distribution that are much different from the rest.

Measures of Variability
1. Range: the difference between the highest and lowest scores in a distribution. It is not a stable or precise measure of variability because it can be affected by a change in just one score.
2. Standard Deviation: is the average distance between each of the scores in a distribution and the mean. One way to describe variability is to consider how far each score is from that center score. It is called standard deviation because it represents the average amount by which the scores deviate from a mean. The larger the number, the more variable are the scores in the distribution (Fig. 4.5, p. 82). The standard deviation is widely used in part because of its relationship to the normal distribution, visually represented as the normal curve displayed in Fig. 4.6, which forms the basis for comparing individual scores with those of a larger group.
3. Normal Curve: One useful feature of normal curve is that a certain percentage of scores always falls between the mean and certain distances above and below the mean. These distances are described as how many standard deviations above or below the mean a score falls.

Types of Scores Used to Compare Performance
Raw scores: one cannot know what a raw score represents without knowing more about the mean and standard deviation of the distribution of scores.
Percentile ranks: are scores that indicate the percentage of persons scoring at or below a given score. So a percentile rank of 82% means that 82% of persons scored below that score.
Standard scores: telling how far a score is from the mean of distribution in terms of standard deviation units. Examples of standard scores are Z scores and college entrance exams (Fig. 4.6).
Z SCORE = (SCORE – MEAN)/STANDARD DEVIATION
Stanines: stanine scores are a type of standard score that divides a distribution into nine parts, each of which includes about one half of a standard deviation.
Percentage: Some types of data may be distorted when percentages are used. One should always examine both the total number and the percentage of persons when comparing measures.
Grade-Equivalent Scores: A grade-equivalent score is reported in years and months. So 3.4 means third grade, fourth month. A grade-equivalent score reports the grade placement for which that score would be considered average. This means that the average student at the reported grade level could be expected to get a similar score on this test.
Use of Correlation Coefficients in Evaluating Measure
Correlations: measures of the relationship between two variables. A correlational relationship is summarized using a descriptive statistic called a correlation coefficient. Regardless of sign (- / +), the size of the number shows how strong the relationship is between the variables.
- A positive correlation coefficient: one variable increases, the other also increases.
- A negative correlation coefficient: one variable increases, the other decreases.

VI. Evaluating the Quality of Education Measures: Reliability and Validity
Reliability: the consistency of scores—obtaining approximately the same score
Validity: what the instrument “claims” to measure is truly what it is measuring—accuracy
Reliability
- Even the most reliable test would not produce the exact same score. This difference is referred to measurement error. Factors that affect reliability include personal characteristics, variations in test setting, in administration and scoring of the test and in participant responses due to guessing.
- Standard error of measurement: SEM = SD √1 - r For example: SEM= 2.7, Observed score= 70, means that the student who received a score of 70 might score 2.7 points higher or lower if s/he retook the test.
- Stability or Test-Retest Reliability (Consistency across time): an average wait time of four or six weeks.
- Equivalent-Form Reliability (Consistency across form)
- Internal Consistency Reliability (Consistency within the instrument): the common method is through split-half reliability. Then Spearman-Brown prophecy formula is applied—the instrument must be long enough. Another approach is to examine the correlations between each item and the overall score on the instrument—a given score consistently measures the same amount of knowledge.
Validity
Does the instrument measure what it is designed to measure? So when constructing a test or using a standardized instrument, validity is the single most important characteristic.
- Content Validity:
1. sampling validity
2. content validity: involving examining each individual item to determine if it measures the content area.
- Criterion-Related Validity: examining a relationship of each measure. It reflects the degree to which two scores on two different measures are correlated.
1. concurrent validity: examines the degree to which one test correlates with another taken in the same time frame (the new and the old test produce similar result).
2. predictive validity: used to predict the future. After 1st time testing, waiting for a while, then if the correlation is high between the 1st test result and the next, then the 1st test is considered to have criterion-related validity in predicting future success.
- Construct Validity: involves a search for evidence that an instrument is accurately measuring an abstract trait or ability. Constructs are traits that are derived from variables and are nonobservable. Construct validity might include aspects of content, concurrent and predictive validity. Some questions might be involved in establishing construct validity:
> Does the measure clearly define the meaning of the construct?
> Can the measure be analyzed into component parts or processes that are appropriate to the construct?
> Is the instrument related to other measures of similar constructs, and is it not related to instruments measuring things that are different?
> Can the measure discriminate between (or identify separately) groups that are known to differ?

Finding Preestablished Measures
One source that many researchers use is the Mental Measurement Year-book (MMY) (Box 4.3. Web Sites with information on Preestablished, Standardized Tests, p. 97).

Criteria for Selecting Preestablished Instruments
A particular instrument may have a high degree of reliability and validity, but for a variety of reasons, it may not be suitable for the population you intend to study. Also, a researcher should examine past study to see what specific instruments other researchers in similar areas have employed. Finally, do not settle for the first measure that you find.

Reliability and Validity in Archival Data
It is important to consider possible inaccuracies that the data may contain. A researcher should validate or double-check the raw data. The researcher might conduct interviews with the individuals who originally collected the data. The researcher would also want to examine the instruments used to collect the data and any information about their piloting and administration. Another method for collecting the data should be employed too.

0 件のコメント: