1, Data description
agpop data file introduction: the U.S. government conducts a general survey on agriculture every five years and collects relevant data on all farms in 50 states. The data file contains a total of 3041 U.S. county-level farms or farm data equivalent to county-level farms. There are 4 regions / rnum, 50 states / snum, and 3041 counties / cnum.
The variables we use are:
county/cnum, state/snum, region/rnum,
Cultivated land area of each county in 1992 (acres92), cultivated land area of each county in 1987 (acres87),
Number of farms owned by each county in 1992 (farms92).
The target variable is the cultivated land area in 1992 (acres92).
2, Stratified random sampling
1. Sampling requirements: take "region" as the hierarchical variable, and simply randomly select 75 sample units for each layer.
(1) Define some variables involved in stratified sampling.
> data=read.csv("E:/experiment/Sampling technical data file.csv",header=T,sep=",") >N=nrow(data) #Overall capacity > Nh=table(data$region) #Total number of units on level h > Wh=Nh/N #Layer right of layer h > L=length(unique(data$region))#Number of layers > nh=rep(75,L) #Number of sample units per layer
Operation result: [copy and paste here]
> N=nrow(data);N #Overall capacity [1] 3041 > Nh=table(data$region);Nh #Total number of units on level h NC NE S W 1049 211 1368 413 > Wh=Nh/N;Wh #Layer right of layer h NC NE S W 0.34495232 0.06938507 0.44985202 0.13581059 > L=length(unique(data$region));L#Number of layers [1] 4 > nh=rep(75,L);nh #Number of sample units per layer [1] 75 75 75 75
Explain the results:
That is, the layer weights of each layer are 0.34495232, 0.06938507, 0.44985202 and 0.13581059, which are divided into four layers, and the number of samples per layer is 75.
(2) Call the hierarchical sampling function "strata", in which the first parameter is the overall data set (sorted by hierarchical variables here), the second variable is the hierarchical variable, the third parameter is the number of sample units of each layer, and the fourth parameter is the sampling method of each layer (the optional methods are "srswor", "srswr", "poisson" and "systematic").
> st=sampling:::strata(data[order(data$region),],"region",nh,"srswor") #Call hierarchical sampling function
Operation result: [copy and paste here]
>st=strata(data[order(data$region),],"region",nh,"srswor");st#Call hierarchical sampling function region ID_unit Prob Stratum 32 NC 32 0.07149666 1 44 NC 44 0.07149666 1 55 NC 55 0.07149666 1 59 NC 59 0.07149666 1 65 NC 65 0.07149666 1 66 NC 66 0.07149666 1 76 NC 76 0.07149666 1 78 NC 78 0.07149666 1 85 NC 85 0.07149666 1 94 NC 94 0.07149666 1 ......
Result interpretation: [explain the results]
srswor sampling method is used to stratified sample the data set, and the output result is the position of the selected sample in the population.
(3) Call the function "getdata(data,st)" to extract the sample data
> data.strata=getdata(data,st) #Extract the sample data
Operation result: [copy and paste here]
> data.strata county cnum state snum acres92 acres87 acres82 farms92 farms87 farms82 largef92 largef87 32 ALCORN COUNTY 32 TX 43 443027 393949 458988 297 296 341 107 103 44 ALLEGHENY COUNTY 44 IL 14 203428 204191 222486 724 808 933 46 33 55 AMHERST COUNTY 55 NE 29 96093 120598 119849 388 475 482 24 24 59 ANDERSON COUNTY 59 KS 16 328094 331397 325328 254 285 277 112 122 65 ANDROSCOGGIN COUNTY 65 OK 36 157105 149021 153082 866 894 964 26 11 66 ANGELINA COUNTY 66 MI 22 22488 22196 37326 303 334 439 3 2 76 ARANSAS COUNTY 76 MI 22 31427 31103 39867 178 178 229 3 0 78 ARCHER COUNTY 78 MN 23 168073 178100 182984 563 640 740 21 18
Result interpretation: [explain the results]
The output results are the quantitative characteristics of the selected samples.
2. Estimates
(1) Define the sample weight variable pw, and each sample weight variable is the reciprocal of its sampling probability
> pw=1/st$Prob
Operation result: [copy and paste here]
> pw=1/st$Prob;pw [1] 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 [13] 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 [25] 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667......
Result interpretation: [explain the results]
The output result is the weight variable of each sample.
(2) Define the fpc variable, which is defined as the total number of cells in the layer where each sample cell is located
> fpc=as.numeric(table(data$region)[data.strata$region])
Operation result: [copy and paste here]
> fpc=as.numeric(table(data$region)[data.strata$region]);fpc [1] 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 [26] 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049......
Result interpretation: [explain the results]
The output result is the total number of cells in the layer where each sample cell is located.
(3) Add the weight variable and fpc variable to the data set of the sample unit
> agstrat=as.data.frame(cbind(data.strata,pw,fpc))
Operation result: [copy and paste here]
> agstrat=as.data.frame(cbind(data.strata,pw,fpc));agstrat county cnum state snum acres92 acres87 acres82 farms92 farms87 farms82 largef92 largef87 largef82 smallf92 32 ALCORN COUNTY 32 TX 43 443027 393949 458988 297 296 341 107 103 127 12 44 ALLEGHENY COUNTY 44 IL 14 203428 204191 222486 724 808 933 46 33 36 69 55 AMHERST COUNTY 55 NE 29 96093 120598 119849 388 475 482 24 24 25 62 59 ANDERSON COUNTY 59 KS 16 328094 331397 325328 254 285 277 112 122 114 8
Result interpretation: [explain the results]
The output result is the data set added with weight variables and fpc variables.
(4) Call the svydesign function to define the sampling design and sampling results. The id parameter defines the group variable, and if the group variable is not used, it is represented by "0" or "1"; strata parameter defines hierarchical variables; The weights parameter defines the weight variable; The data parameter defines the data set of the sampled sample unit; The fpc parameter defines the fpc variable.
> dstrat<-svydesign(id=~1,strata = ~region,weights=~pw,data = agstrat,fpc=~fpc)
Operation result: [copy and paste here]
> dstrat<-svydesign(id=~1,strata = ~region,weights=~pw,data = agstrat,fpc=~fpc);dstrat Stratified Independent Sampling design svydesign(id = ~1, strata = ~region, weights = ~pw, data = agstrat, fpc = ~fpc)
(5) View sampling design and sampling results. Where Probability is the sampling Probability of the sample unit.
> summary(dstrat)
Operation result: [copy and paste here]
> summary(dstrat) Stratified Independent Sampling design svydesign(id = ~1, strata = ~region, weights = ~pw, data = agstrat, fpc = ~fpc) Probabilities: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.05482 0.06733 0.12655 0.16584 0.22506 0.35545 Stratum Sizes: NC NE S W obs 75 75 75 75 design.PSU 75 75 75 75 actual.PSU 75 75 75 75 Population stratum sizes (PSUs): NC NE S W 1049 211 1368 413 Data variables: [1] "county" "cnum" "state" "snum" "acres92" "acres87" "acres82" "farms92" "farms87" "farms82" "largef92" [12] "largef87" "largef82" "smallf92" "smallf87" "smallf82" "rnum" "region" "ID_unit" "Prob" "Stratum" "pw" [23] "fpc"
Result interpretation: [explain the results]
Check the sampling results. The mean value of the sampling probability of the sample unit is 0.16584.
(6) Simple estimation of the mean of the target variable (acres92) and its standard error estimation. The first parameter defines the target variable, and the second parameter defines the sampling design and sampling result. If the deff parameter is set to TRUE, the design effect of the sampling design can be output in the result. The default value is FALSE.
> svymean(~acres92,dstrat,deff=TRUE)
Operation result: [copy and paste here]
> svymean(~acres92,dstrat,deff=TRUE) mean SE DEff acres92 318856 34340 1.3562
Result interpretation: [explain the results]
From the output results, the simple estimated value of the variable mean is 327294 and the standard error is 32363.
(7) Simple estimation of the total value of the target variable (acres92) and its standard error estimation. The first parameter defines the target variable, and the second parameter defines the sampling design and sampling result. If the deff parameter is set to TRUE, the design effect of the sampling design can be output in the result. The default value is FALSE.
> svytotal(~acres92,dstrat,deff=TRUE)
Operation result: [copy and paste here]
> svytotal(~acres92,dstrat,deff=TRUE) total SE DEff acres92 969642162 104429224 1.3562
Result interpretation: [explain the results]
From the output results, the simple estimated value of the total value of variables is 995299823 and the standard error is 98414846.