Stratified random sampling of sampling survey

Posted by hessodreamy on Sun, 16 Jan 2022 04:21:09 +0100

1, Data description

agpop data file introduction: the U.S. government conducts a general survey on agriculture every five years and collects relevant data on all farms in 50 states. The data file contains a total of 3041 U.S. county-level farms or farm data equivalent to county-level farms. There are 4 regions / rnum, 50 states / snum, and 3041 counties / cnum.
The variables we use are:
county/cnum, state/snum, region/rnum,
Cultivated land area of each county in 1992 (acres92), cultivated land area of each county in 1987 (acres87),
Number of farms owned by each county in 1992 (farms92).
The target variable is the cultivated land area in 1992 (acres92).

2, Stratified random sampling

1. Sampling requirements: take "region" as the hierarchical variable, and simply randomly select 75 sample units for each layer.

(1) Define some variables involved in stratified sampling.

> data=read.csv("E:/experiment/Sampling technical data file.csv",header=T,sep=",")
>N=nrow(data) #Overall capacity
> Nh=table(data$region) #Total number of units on level h
> Wh=Nh/N #Layer right of layer h
> L=length(unique(data$region))#Number of layers
> nh=rep(75,L) #Number of sample units per layer

Operation result: [copy and paste here]

> N=nrow(data);N #Overall capacity
[1] 3041
> Nh=table(data$region);Nh #Total number of units on level h
  NC   NE    S    W 
1049  211 1368  413 
> Wh=Nh/N;Wh #Layer right of layer h
        NC         NE          S          W 
0.34495232 0.06938507 0.44985202 0.13581059 
> L=length(unique(data$region));L#Number of layers
[1] 4
> nh=rep(75,L);nh #Number of sample units per layer
[1] 75 75 75 75

Explain the results:
That is, the layer weights of each layer are 0.34495232, 0.06938507, 0.44985202 and 0.13581059, which are divided into four layers, and the number of samples per layer is 75.

(2) Call the hierarchical sampling function "strata", in which the first parameter is the overall data set (sorted by hierarchical variables here), the second variable is the hierarchical variable, the third parameter is the number of sample units of each layer, and the fourth parameter is the sampling method of each layer (the optional methods are "srswor", "srswr", "poisson" and "systematic").

> st=sampling:::strata(data[order(data$region),],"region",nh,"srswor") #Call hierarchical sampling function

Operation result: [copy and paste here]

>st=strata(data[order(data$region),],"region",nh,"srswor");st#Call hierarchical sampling function
     region ID_unit       Prob Stratum
32       NC      32 0.07149666       1
44       NC      44 0.07149666       1
55       NC      55 0.07149666       1
59       NC      59 0.07149666       1
65       NC      65 0.07149666       1
66       NC      66 0.07149666       1
76       NC      76 0.07149666       1
78       NC      78 0.07149666       1
85       NC      85 0.07149666       1
94       NC      94 0.07149666       1
......

Result interpretation: [explain the results]
srswor sampling method is used to stratified sample the data set, and the output result is the position of the selected sample in the population.

(3) Call the function "getdata(data,st)" to extract the sample data

> data.strata=getdata(data,st)  #Extract the sample data

Operation result: [copy and paste here]

> data.strata
                  county cnum state snum acres92 acres87 acres82 farms92 farms87 farms82 largef92 largef87
32         ALCORN COUNTY   32    TX   43  443027  393949  458988     297     296     341      107      103
44      ALLEGHENY COUNTY   44    IL   14  203428  204191  222486     724     808     933       46       33
55        AMHERST COUNTY   55    NE   29   96093  120598  119849     388     475     482       24       24
59       ANDERSON COUNTY   59    KS   16  328094  331397  325328     254     285     277      112      122
65   ANDROSCOGGIN COUNTY   65    OK   36  157105  149021  153082     866     894     964       26       11
66       ANGELINA COUNTY   66    MI   22   22488   22196   37326     303     334     439        3        2
76        ARANSAS COUNTY   76    MI   22   31427   31103   39867     178     178     229        3        0
78         ARCHER COUNTY   78    MN   23  168073  178100  182984     563     640     740       21       18

Result interpretation: [explain the results]
The output results are the quantitative characteristics of the selected samples.

2. Estimates

(1) Define the sample weight variable pw, and each sample weight variable is the reciprocal of its sampling probability

> pw=1/st$Prob

Operation result: [copy and paste here]

> pw=1/st$Prob;pw
  [1] 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667
 [13] 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667
 [25] 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667 13.986667......

Result interpretation: [explain the results]
The output result is the weight variable of each sample.

(2) Define the fpc variable, which is defined as the total number of cells in the layer where each sample cell is located

> fpc=as.numeric(table(data$region)[data.strata$region])

Operation result: [copy and paste here]

> fpc=as.numeric(table(data$region)[data.strata$region]);fpc
  [1] 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049
 [26] 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049......

Result interpretation: [explain the results]
The output result is the total number of cells in the layer where each sample cell is located.

(3) Add the weight variable and fpc variable to the data set of the sample unit

> agstrat=as.data.frame(cbind(data.strata,pw,fpc))

Operation result: [copy and paste here]

> agstrat=as.data.frame(cbind(data.strata,pw,fpc));agstrat
                  county cnum state snum acres92 acres87 acres82 farms92 farms87 farms82 largef92 largef87 largef82 smallf92
32         ALCORN COUNTY   32    TX   43  443027  393949  458988     297     296     341      107      103      127       12
44      ALLEGHENY COUNTY   44    IL   14  203428  204191  222486     724     808     933       46       33       36       69
55        AMHERST COUNTY   55    NE   29   96093  120598  119849     388     475     482       24       24       25       62
59       ANDERSON COUNTY   59    KS   16  328094  331397  325328     254     285     277      112      122      114        8

Result interpretation: [explain the results]
The output result is the data set added with weight variables and fpc variables.

(4) Call the svydesign function to define the sampling design and sampling results. The id parameter defines the group variable, and if the group variable is not used, it is represented by "0" or "1"; strata parameter defines hierarchical variables; The weights parameter defines the weight variable; The data parameter defines the data set of the sampled sample unit; The fpc parameter defines the fpc variable.

> dstrat<-svydesign(id=~1,strata = ~region,weights=~pw,data = agstrat,fpc=~fpc)

Operation result: [copy and paste here]

> dstrat<-svydesign(id=~1,strata = ~region,weights=~pw,data = agstrat,fpc=~fpc);dstrat
Stratified Independent Sampling design
svydesign(id = ~1, strata = ~region, weights = ~pw, data = agstrat, 
    fpc = ~fpc)

(5) View sampling design and sampling results. Where Probability is the sampling Probability of the sample unit.

> summary(dstrat)

Operation result: [copy and paste here]

> summary(dstrat)
Stratified Independent Sampling design
svydesign(id = ~1, strata = ~region, weights = ~pw, data = agstrat, 
    fpc = ~fpc)
Probabilities:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05482 0.06733 0.12655 0.16584 0.22506 0.35545 
Stratum Sizes: 
           NC NE  S  W
obs        75 75 75 75
design.PSU 75 75 75 75
actual.PSU 75 75 75 75
Population stratum sizes (PSUs): 
  NC   NE    S    W 
1049  211 1368  413 
Data variables:
 [1] "county"   "cnum"     "state"    "snum"     "acres92"  "acres87"  "acres82"  "farms92"  "farms87"  "farms82"  "largef92"
[12] "largef87" "largef82" "smallf92" "smallf87" "smallf82" "rnum"     "region"   "ID_unit"  "Prob"     "Stratum"  "pw"      
[23] "fpc"

Result interpretation: [explain the results]
Check the sampling results. The mean value of the sampling probability of the sample unit is 0.16584.

(6) Simple estimation of the mean of the target variable (acres92) and its standard error estimation. The first parameter defines the target variable, and the second parameter defines the sampling design and sampling result. If the deff parameter is set to TRUE, the design effect of the sampling design can be output in the result. The default value is FALSE.

> svymean(~acres92,dstrat,deff=TRUE)

Operation result: [copy and paste here]

> svymean(~acres92,dstrat,deff=TRUE)
          mean     SE   DEff
acres92 318856  34340 1.3562

Result interpretation: [explain the results]
From the output results, the simple estimated value of the variable mean is 327294 and the standard error is 32363.

(7) Simple estimation of the total value of the target variable (acres92) and its standard error estimation. The first parameter defines the target variable, and the second parameter defines the sampling design and sampling result. If the deff parameter is set to TRUE, the design effect of the sampling design can be output in the result. The default value is FALSE.

> svytotal(~acres92,dstrat,deff=TRUE)

Operation result: [copy and paste here]

> svytotal(~acres92,dstrat,deff=TRUE)
            total        SE   DEff
acres92 969642162 104429224 1.3562

Result interpretation: [explain the results]
From the output results, the simple estimated value of the total value of variables is 995299823 and the standard error is 98414846.

Topics: R Language