Geography 360, Fall 2003:

Problem Set Seven

 

In this Problem Set, we will practice the calculation and interpretation of C 2, NND (nearest neighbor analysis), ANOVA, and correlation tests. In order to prepare for the final exam, be sure to identify the situations (the types of data available and types of questions or hypotheses posed) in which you would utilize each of the tests: are you testing observed values assigned to categories against expected values based on random distribution, average distributions, or modeled outcomes (C 2)?; are you assessing the randomness of a spatial distribution (nearest neighbor analysis)?; are you comparing more than two samples in order to make a statement about the possibility of their being drawn from the same population (ANOVA)?; are you testing for a relationship between two variables without assuming any causality (correlation, remember that you need to determine whether or not the population for each variable is normally distributed, the sample is random, and the data is interval or ratio in order to use the parametric- Pearson’s product moment- vs. the non-parametric- Spearman’s rank correlation- tests)?; are you testing for a causal or directional relationship with the goal of calculating the value of a dependent variable at a given value of an independent variable (regression)? Also, remember that each of the tests of significance that we perform is based on the normal distribution. If a value is considered significant (in C 2, is average difference between observed and expected value significantly large to imply an effect not modeled (e.g., not random, not based on probability, etc.); in nearest neighbor, is the difference between the average distance in the observed distribution significantly different from that of a random spatial distribution; in ANOVA, the between group variance is significantly great to limit the possibility of all samples being drawn from the same population; in correlation, the relationship between the variables is significantly different from a random, or uncorrelated, distribution; in regression, is the explained variation in the dependent variable a significantly large proportion of the total variation in the dependent variable?), it is one of a very limited (e.g., < 5%, <1%, etc.) number of possible outcomes and, thus, less likely to be due to the uncertainty inherent to sampling.

 

  1. For the C2 test, first calculate the significance (p-value) of the coin flips we all participated in during the first lecture on probability. Recall that the class produced a frequency distribution of no one getting no heads in five coin flips, 7 had 1 heads, 3 had two heads, 9 had three heads, 3 had four heads, and 1 had five heads. (Because the test requires expected values of at least 5 in at least 4/5ths of the categories we will cheat and multiply our values by seven to get more appropriate category values. Note that this is not proper procedure, but in the interest of the problem set… So the new distribution is [ 0, 49, 21, 63, 21, 7].) Compare this with the expected random distribution of 4.9 with no heads, 25.9 with one, 49.0 with two, 49.0 with three, 25.9 with four, and 4.9 with five. Is our distribution significantly different from the random distribution? In this case the null hypothesis would be that the distributions were similar; the alternative hypothesis would be that the distributions are different and there is the possibility of loaded coins or misreporting of data.

 

  1. For the nearest neighbor analysis, I will ask you to do a little data gathering on your own.  The state of Wisconsin can be divided into several geophysical regions (e.g., the driftless region, the sand plain region, the ridge and valley region, etc.).  Consider how we might use the nearest neighbor calculation to determine if the physical geography of these regions might impact the settlement pattern in each.  By selecting a county from each region (for example, Monroe County for the driftless, Clark for the sand plain, Jefferson for the ridge and valley, etc.) we can calculate the nearest neighbor R-value (the ration of observed nearest neighbor vs. random nearest neighbor for a given density) for named places in each.  (Note that I suggested counties without large urban centers in which suburban development would bias the calculation toward a clustered distribution.  Another option would be to use a historical map for this project.)  Values closer to 1 would suggest a random distribution and little influence of physical or other factors on the distribution.  How would you interpret values closer to zero (clustered) or closer to 2.149 (dispersed), relative to influences on settlement location?  Calculate the nearest neighbor statistic for at least three counties (consider including an example from northern Wisconsin as well).  Remember that you will need to calculate area of county for density and think about how you will deal with cities, towns, or other places that are located closer to a named place outside vs. inside of the county you are examining.  Do you see differences in the values?  Can you think of explanations for these differences?  Now, calculate the significance of the R-values for each county.  Are any of the settlement patterns significantly different from a random distribution?  Finally, you can also use your calculations to run a C2 or an ANOVA test.  For the C2 test, your calculated R value can be the observed value for each category (county) and the expected value would be an R value calculated for a combined sample (remember to adjust the density calculation for the larger sample size – total number of points divided by combined area of counties).  Does this test suggest that there is a significant difference among the counties?  For the ANOVA, use the nearest neighbor distances for each place in the counties as a variable score.  Each county is now a sample of several distances.  Does an ANOVA analysis suggest that the county settlement patterns are different?  (Be aware as you interpret these tests that the ANOVA analysis does not consider the role of density.  How does this affect your interpretation of the results?)

 

  1. To practice the ANOVA test, I will ask you to analyze some of the data that I collected for my dissertation. (The following provides some context to the question that the data and test are expected to address. If it makes no sense, you may still conduct the test and interpret the results. If you choice to skip the explanation, you may do that as well.) In economics, conventional trade theory suggests that free trade among and within countries will benefit all by encouraging more efficient use of factors of production (in agriculture this would include labor, capital, and the land on which the crops are cultivated). In particular, trade theorists argue that farmers will learn to specialize in that crop which they produce most efficiently in order to trade for crops produced more efficiently elsewhere. Some have challenged the claims of uniform benefits based on the potential negative ecological impacts associated with monoculture production. Conventional economists have suggested, however, that many times the more efficient means of production is also the most environmentally benign. In order to test the potential economic impact of a regional free trade agreement in South America, I collected insect samples from fields of locally important cash crops, those in which farmers would potentially specialize. I wanted to determine if any particular crop (and its associated management type) would exhibit consistently higher (or lower) insect biodiversity levels which would, theoretically indicate a production system that was more (or less) sustainable. The four cash crops tested were monocrop yerba mate (a South American tea), interplanted yerba mate, maize, and soybeans. Does the ANOVA analysis suggest that the crops types may represent distinct populations (in terms of number of species)? Which crops appear to be more sustainable?

 

Crop type

Number of species

Yerba mate, M

23

26

22

22

23

19

24

22

25

34

26

24

28

20

34

Yerba mate, I

19

34

20

26

26

27

24

29

45

28

26

24

 

 

 

Maize

28

16

25

14

20

17

11

 

 

 

 

 

 

 

 

Soybeans

27

29

15

21

26

25

17

18

 

 

 

 

 

 

 

 

 

  1. Another means to practice the ANOVA test is to compare statistics representing places in distinct regions. Go to the US Census Bureau page and pick a variable that is collected and summarized by state. Divide the states into regions (you can use those created by the Census Bureau [Northeast, South, etc.] or a grouping of your choosing, but try to provide a justification for your grouping. State your null and alternative hypotheses. Now, determine if the variable scores for the regions are significantly different such that they would represent different populations, at least as far as the variable of interest is concerned. (What is your p-value? Discuss its implications in regard to the null hypothesis.)

 

  1. A straightforward use of correlation analysis is common in biogeographical research where a characteristic of a plant or animal is measured along a continuum. For example, the geographer might perceive that a particular tree species assumes larger dimensions at sea level than in nearby mountains. Data on tree diameters could be collected at varying altitudes to test for such a correlation. The null hypothesis in this case is what? And the alternative hypothesis?

 

Altitude

500

550

650

700

750

850

900

1000

1150

1200

 

Tree Diameter

11.7

7.0

11.3

5.5

6.9

9.3

6.8

5.8

8.3

5.7

a.                 Calculate the Pearson’s product moment coefficient. (Remember, you will first need to calculate the mean for each variable, and then the differences from the mean.) What is the p-value for the correlation? Would you feel confident in rejecting the null hypothesis?

 

b.                 Recalculate the correlation coefficient, this time using the Spearman’s rank correlation coefficient. What is your p-value in this case? Would it change your interpretation of the correlation?

 

  1. For additional practice with calculating the correlation coefficients, try making some spurious correlations similar to that presented during lecture (i.e., between population growth rate and economic growth rate). Some interesting data on so-called developing countries is found at the US Agency for International Development (AID) website. Select two variables (e.g., education spending, literacy rate, birth rate, etc.) and ten countries which represent a range of values for both variables. (Remember if you use the democracy ratings, these are perhaps better interpreted as ordinal data and should be used only in the Spearman’s rank correlation test. Why?) Calculate the respective correlation coefficients. Can you identify any amazing, humorous, or other correlations?