Geography
360, Fall 2003:
Problem Set Seven
In this Problem Set, we will
practice the calculation and interpretation of C 2,
NND (nearest neighbor analysis), ANOVA, and correlation
tests. In order to prepare for the final exam, be sure to identify the
situations (the types of data available and types of questions or hypotheses
posed) in which you would utilize each of the tests: are you testing observed
values assigned to categories against expected values based on random
distribution, average distributions, or modeled outcomes (C
2)?; are you assessing the randomness of a spatial
distribution (nearest neighbor analysis)?; are you comparing more than
two samples in order to make a statement about the possibility of their being
drawn from the same population (ANOVA)?; are you testing for a
relationship between two variables without assuming any causality (correlation,
remember that you need to determine whether or not the population for each variable
is normally distributed, the sample is random, and the data is interval or
ratio in order to use the parametric- Pearson’s product moment-
vs. the non-parametric-
Spearman’s rank correlation- tests)?; are you testing for a causal or directional relationship
with the goal of calculating the value of a dependent variable at a given value
of an independent variable (regression)? Also, remember that each of the
tests of significance that we perform is based on the normal distribution. If a
value is considered significant (in C 2, is average
difference between observed and expected value significantly large to imply an
effect not modeled (e.g., not random, not based on probability, etc.); in nearest
neighbor, is the difference between the average distance in the observed
distribution significantly different from that of a random spatial
distribution; in ANOVA, the between group variance is significantly
great to limit the possibility of all samples being drawn from the same
population; in correlation, the relationship between the variables is
significantly different from a random, or uncorrelated, distribution; in regression,
is the explained variation in the dependent variable a significantly large
proportion of the total variation in the dependent variable?), it is one of a
very limited (e.g., < 5%, <1%, etc.) number of possible outcomes and,
thus, less likely to be due to the uncertainty inherent to sampling.
- For the C2 test, first calculate the
significance (p-value) of the coin flips we all participated in during the
first lecture on probability. Recall that the class produced a frequency
distribution of no one getting no heads in five coin flips, 7 had 1 heads,
3 had two heads, 9 had three heads, 3 had four heads, and 1 had five
heads. (Because the test requires expected values of at least 5 in at
least 4/5ths of the categories we will cheat and
multiply our values by seven to get more appropriate category values. Note
that this is not proper procedure, but in the interest of the
problem set… So the new distribution is [ 0, 49, 21, 63, 21, 7].) Compare
this with the expected random distribution of 4.9 with no heads, 25.9 with
one, 49.0 with two, 49.0 with three, 25.9 with four, and 4.9 with five. Is
our distribution significantly different from the random distribution? In
this case the null hypothesis would be that the distributions were
similar; the alternative hypothesis would be that the distributions are
different and there is the possibility of loaded coins or misreporting of
data.
- For the nearest neighbor analysis, I will ask
you to do a little data gathering on your own. The state of Wisconsin can be divided into several
geophysical regions (e.g., the driftless region, the sand plain region,
the ridge and valley region, etc.).
Consider how we might use the nearest neighbor calculation to determine
if the physical geography of these regions might impact the settlement
pattern in each. By selecting a
county from each region (for example, Monroe County for the driftless, Clark
for the sand plain, Jefferson for the ridge and valley, etc.) we can calculate
the nearest neighbor R-value (the ration of observed nearest neighbor vs.
random nearest neighbor for a given density) for named places in each. (Note that I suggested counties without
large urban centers in which suburban development would bias the
calculation toward a clustered distribution. Another option would be to use a historical map for this
project.) Values closer to 1 would
suggest a random distribution and little influence of physical or other
factors on the distribution. How
would you interpret values closer to zero (clustered) or closer to 2.149
(dispersed), relative to influences on settlement location? Calculate the nearest neighbor
statistic for at least three counties (consider including an example from
northern Wisconsin as well).
Remember that you will need to calculate area of county for density
and think about how you will deal with cities, towns, or other places that
are located closer to a named place outside vs. inside of the county you
are examining. Do you see
differences in the values? Can you
think of explanations for these differences? Now, calculate the significance of the R-values for each
county. Are any of the settlement
patterns significantly different from a random distribution? Finally, you can also use your
calculations to run a C2 or an ANOVA test. For the C2
test, your calculated R value can be the observed value for each category
(county) and the expected value would be an R value calculated for a
combined sample (remember to adjust the density calculation for the larger
sample size – total number of points divided by combined area of
counties). Does this test suggest
that there is a significant difference among the counties? For the ANOVA, use the nearest neighbor
distances for each place in the counties as a variable score. Each county is now a sample of several
distances. Does an ANOVA analysis
suggest that the county settlement patterns are different? (Be aware as you interpret these tests
that the ANOVA analysis does not consider the role of density. How does this affect your
interpretation of the results?)
- To practice the ANOVA test, I will ask you to
analyze some of the data that I collected for my dissertation. (The
following provides some context to the question that the data and test are
expected to address. If it makes no sense, you may still conduct the test
and interpret the results. If you choice to skip the explanation, you may
do that as well.) In economics, conventional trade theory suggests
that free trade among and within countries will benefit all by encouraging
more efficient use of factors of production (in agriculture this would
include labor, capital, and the land on which the crops are cultivated).
In particular, trade theorists argue that farmers will learn to specialize
in that crop which they produce most efficiently in order to trade for
crops produced more efficiently elsewhere. Some have challenged the claims
of uniform benefits based on the potential negative ecological impacts
associated with monoculture production. Conventional economists have
suggested, however, that many times the more efficient means of production
is also the most environmentally benign. In order to test the potential
economic impact of a regional free trade agreement in South America, I
collected insect samples from fields of locally important cash crops,
those in which farmers would potentially specialize. I wanted to determine
if any particular crop (and its associated management type) would exhibit consistently
higher (or lower) insect biodiversity levels which would, theoretically
indicate a production system that was more (or less) sustainable. The four
cash crops tested were monocrop yerba mate (a South American tea),
interplanted yerba mate, maize, and soybeans. Does the ANOVA analysis
suggest that the crops types may represent distinct populations (in terms
of number of species)? Which crops appear to be more sustainable?
|
Crop
type
|
Number
of species
|
|
Yerba mate,
M
|
23
|
26
|
22
|
22
|
23
|
19
|
24
|
22
|
25
|
34
|
26
|
24
|
28
|
20
|
34
|
|
Yerba mate,
I
|
19
|
34
|
20
|
26
|
26
|
27
|
24
|
29
|
45
|
28
|
26
|
24
|
|
|
|
|
Maize
|
28
|
16
|
25
|
14
|
20
|
17
|
11
|
|
|
|
|
|
|
|
|
|
Soybeans
|
27
|
29
|
15
|
21
|
26
|
25
|
17
|
18
|
|
|
|
|
|
|
|
- Another means to practice the ANOVA test is to
compare statistics representing places in distinct regions. Go to the US
Census Bureau page and pick a variable that is collected and summarized by
state. Divide the states into regions (you can use those created by the
Census Bureau [Northeast, South, etc.] or a grouping of your choosing, but
try to provide a justification for your grouping. State your null and
alternative hypotheses. Now, determine if the variable scores for the
regions are significantly different such that they would represent
different populations, at least as far as the variable of interest is
concerned. (What is your p-value? Discuss its implications in regard to
the null hypothesis.)
- A straightforward use of correlation analysis
is common in biogeographical research where a characteristic of a plant or
animal is measured along a continuum. For example, the geographer might
perceive that a particular tree species assumes larger dimensions at sea
level than in nearby mountains. Data on tree diameters could be collected
at varying altitudes to test for such a correlation. The null hypothesis
in this case is what? And the alternative hypothesis?
|
|
Altitude
|
500
|
550
|
650
|
700
|
750
|
850
|
900
|
1000
|
1150
|
1200
|
|
|
Tree Diameter
|
11.7
|
7.0
|
11.3
|
5.5
|
6.9
|
9.3
|
6.8
|
5.8
|
8.3
|
5.7
|
a.
Calculate
the Pearson’s product moment coefficient. (Remember, you will first need to
calculate the mean for each variable, and then the differences from the mean.)
What is the p-value for the correlation? Would you feel confident in rejecting
the null hypothesis?
b.
Recalculate
the correlation coefficient, this time using the Spearman’s rank correlation
coefficient. What is your p-value in this case? Would it change your
interpretation of the correlation?
- For additional practice with calculating the correlation
coefficients, try making some spurious correlations similar to that
presented during lecture (i.e., between population growth rate and
economic growth rate). Some interesting data on so-called developing
countries is found at the US Agency for International Development (AID)
website. Select two variables (e.g., education spending, literacy rate,
birth rate, etc.) and ten countries which represent a range of values for
both variables. (Remember if you use the democracy ratings, these are
perhaps better interpreted as ordinal data and should be used only in the
Spearman’s rank correlation test. Why?) Calculate the respective
correlation coefficients. Can you identify any amazing, humorous, or other
correlations?