#Draw a picture with Stata # 01 function and drawing type

Posted by mverrier on Tue, 22 Feb 2022 05:44:42 +0100

As the saying goes, a picture is worth a thousand words. Drawing is one of the most basic and important data analysis skills, so that people give the whole behavior process of drawing a more professional name - Data Visualization. As a powerful and flexible data analysis tool, Stata can realize a variety of drawings. This topic [Stata drawing] aims to promote a broad and in-depth understanding of Stata's drawing function. To this end, a series of articles will be launched one after another, striving to show the structure and details of the drawing.

introduction

This article is the first lecture in the # drawing a picture with Stata # series. It is a general introduction to the drawing function and drawing type of Stata.

Functional aspects. The drawing function of Stata is mainly realized through the drawing syntax (Syntax) and its Graph Editor. Among them, the drawing syntax of Stata starts with the graph command, including the drawing commands for drawing various graphs and the drawing management commands for deleting, reading or merging multiple graphs after the graphs are drawn. The following table briefly lists two types of commands.

graph commands

graph management commands

graph twoway

scatterplots, line plots, etc

graph save

save graph to disk

graph matrix

scatterplot matrices

graph use

redisplay graph stored on disk

graph bar

bar charts

graph display

redisplay graph stored in memory

graph dot

dot charts

graph combine

combine multiple graphs

graph box

box-and-whisker plots

graph replay

redisplay graphs stored in memory and on disk

graph pie

pie charts

graph export

export .gph file to PostScript, etc.

Other graphics commands

More commands to draw statistical graphs: Distributional diagnostic plots; Smoothing and densities; Regression diagnostics; Time series; Vector autoregressive (VAR, SVAR, VECM) models; Longitudinal data/panel data

......

The commands for printing a graph, that deal with the graphs currently stored in memory, that describe available schemes and allow you to identify and set the default scheme, that lists available styles, for setting options for printing and exporting graphs, that allows you to draw graphs without displaying them.

Note: the above commands are the official commands of Stata, and many practical drawing commands are written by users. These commands will be introduced in detail in the subsequent articles of this series.

Drawing type. It can be seen from the drawing commands in the above table that Stata drawing is nothing more than to realize several common types of drawing. For drawing commands, we can further divide them into descriptive graph and inferential graph according to the differences of drawing objects. The former focuses on intuitively reflecting the distribution and correlation mode of the data itself, while the latter displays the results of statistical analysis through graphics. The key difference between the two types is whether the source of data used for mapping is based on statistical models. This paper introduces the former, that is, descriptive statistical drawing, which focuses on the visualization of the cleaned data or analysis results. It is one of the important links in the process of empirical analysis, reflecting the author's skills, taste and thinking. The drawing based on inference statistics will be introduced in detail in combination with specific research methods. The following figure shows the contents contained in the "graphics" of the toolbar in the Stata interface (Figure 1).

Drawing type based on descriptive statistics

The following figure shows Stata's drawing command structure and drawing type (Figure 2).

Draw with commands. To draw a graph in Stata, you can click the "graph" button in the above figure, which is very convenient. However, with the increasing demand for skilled techniques and customized drawing, drawing with commands is not only more efficient, but also can continuously strengthen the practical operation ability. It should be noted that since the drawing command is very "huge", it is necessary to continuously accumulate the graphic code in the data of all parties in the study and application; At the same time, make good use of Graph Editor in drawing

Optimize the local details of the drawing. After all, we can't remember the options of all drawing commands.

The drawing code of Stata mainly includes four parts: (1) Graph Commands; (2) Options; (3) Styles; (4) Graph Management Commands. The first three types of commands are the basic elements of drawing using existing data. Take the common graph twoway as an example. Twoway is a family of plots, all of which fit on numerical y and x scales. The syntax structure is as follows:

[graph] twoway plot [if] [in] [,  twoway_options]

Where, "plot" represents a specific type of graphics. The following figure shows all members of the twoway family (Figure 3), and Figure 2 shows only some common graphics types. "[]" indicates parts of the code that can be omitted. Although it can be omitted, this part is the core of mastering drawing commands. Choosing the appropriate plot type can only ensure the "right" drawing, but can not guarantee the "good" drawing. Therefore, the focus of our study of Stata drawing falls on the rendering effect of graphics.

Figure 4 shows the specific contents of twoway options. With these options, we can improve and optimize the rendering effect of graphics drawn based on twoway, for example, adding specific values for the x or y axis (added_line_options). The use of these options is also very regular. They appear after the drawing command, and can be used together with various options we need. There is no difference between them.

The above takes the graph twoway command as an example to illustrate the drawing logic of Stata. However, as shown in Figure 2, there are many kinds of graphics that can be drawn, but their syntax structure is the same. Master the structure of drawing syntax. Once we have data, we can carry out data visualization quickly and gracefully. Next, we show Stata drawing with a set of confusing examples to help us understand Stata's functions and drawing types.

Drawing example

1. Bar charts

The graph bar allows you to draw vertical or horizontal bar / column charts.

In the vertical bar chart, the y-axis is a numerical variable, the x-axis is a classified variable, and the horizontal bar chart is vice versa.

*** Bar graph syntax example ***  
    graph bar (mean) numeric_var, over(cat_var) //(mean) is numeric_var statistics, if removed, then percent is the default statistics
    graph hbar (mean) numeric_var, over(cat_var) //hbar stands for horizontal bar charts

*** "," After that, there are various options to optimize the drawing effect ***
    **group_options
      over(varname [, over subopts]): specifies a categorical variable over which the yvars are to be repeated;
      nofill: specifies that missing subcategories be omitted;
      allcategories: specifies that all categories in the entire dataset be retained for the over() variables;
      
    **yvar_options
      graph bar y1 y2 y3, ascategory whatever_other_options //ascategory is a useful option
      graph bar y, over(group) asyvars whatever_other_options
      graph bar (mean) inc_male inc_female, over(region) percentage stack
      graph bar (mean) wage, over(sex) over(region) asyvars percentage stack
      ...
      
    **lookofbar_options
      bargap(#):specifies the gap to be left between yvar bars as a percentage-of-bar-width units, and the
                default is bargap(0)
      intensity(#):  specify the intensity of the color used to fill the inside of the bar
      ...

    **legending_options
      If more than one yvar is specified, a legend is produced.
      
    **axis_options
      axis_scale_options: specify how the numerical y axis is scaled and how it looks;
      axis_label_options: specify how the numerical y axis is to be labeled;
      ytitle(): overrides the default title for the numerical y axis;
      
    **title_and_other_options
      text(): adds text to a specified location on the graph;
      yline(): adds horizontal (bar) or vertical (hbar) lines at specified y values;
      aspect_option: allows you to control the relationship between the height and width of a graph's plot region
      std_options: Options for use with graph construction commands, which allow you to add titles, control the graph size, save the graph on disk, and much more;
      by(varlist, . . . ) : draws separate plots within one graph;
    
    **Suboptions for use with over( ) and yvaroptions( )
      relabel(# "text" . . . ) : specifies text to override the default category labeling;
      gap(#) and gap(*#): specify the gap between the bars in this over() group;gap(#) is specified in
                          percentage-of-bar-width units, so gap(67) means two-thirds the width of a bar. gap(*#) allows
                          modifying the default gap. gap(*1.2) would increase the gap by 20%, and gap(*.8) would
                          decrease the gap by 20%.
      sort(varname), sort(#), and sort((stat) varname): control how bars are ordered;
      

This paper attempts to give a drawing example of bar chart under the combination of bar direction and "over()" option. Next, we use a US City temperature data with 956 observation points to show the drawing idea of the bar chart and the usage of various options.

(1) Vertical bar chart + 1 over()

use https://www.stata-press.com/data/r17/citytemp,clear
describe
		/*
		 Observations:           956                  City temperature data
		    Variables:             6                  3 Mar 2020 19:17
		----------------------------------------------------------------------------
		Variable      Storage   Display    Value
		    name         type    format    label      Variable label
		----------------------------------------------------------------------------
		division        int     %16.0g     division   Census division
		region          int     %13.0g     region     Census region
		heatdd          int     %8.0g                 Heating degree days
		cooldd          int     %8.0g                 Cooling degree days
		tempjan         float   %9.0g                 Average January temperature
		tempjuly        float   %9.0g                 Average July temperature
		----------------------------------------------------------------------------
		Sorted by: region  */
		
graph bar (mean) tempjuly tempjan, over(region) ///
			  bargap(-30) ///*bargap(#) specifies the gap to be left between yvar bars as a percentage-of-bar-width units. The default is bargap(0), meaning that bars touch.
			  legend( label(1 "July") label(2 "January") ) ///
			  ytitle("Temperature( Fahrenheit)") ///
			  title("Average July and January temperatures") ///
			  subtitle("by regions of the United States") ///
			  note("Source: U.S. Census Bureau, U.S. Dept. of Commerce") ///
			  graphregion(fcolor(white))  plotregion(fcolor(white))

(2) Horizontal bar chart + 2 over()

use https://www.stata-press.com/data/r17/citytemp, clear
graph hbar (mean) tempjan, over(division) over(region) nofill ///*nofill specifies that missing subcategories be omitted
		      ytitle("Degrees Fahrenheit") ///
			  title("Average January temperature") ///
			  subtitle("by region and division of the United States") ///
		      note("Source: U.S. Census Bureau, U.S. Dept. of Commerce") ///
		      graphregion(fcolor(white))  plotregion(fcolor(white))

(3) Vertical bar chart + 3 over()

use https://www.stata-press.com/data/r17/nlsw88, clear / / load a new data
describe
		/*
		Observations:         2,246                  NLSW, 1988 extract
	       Variables:            17                  1 May 2020 22:52
		                                             (_dta has notes)
		------------------------------------------------------------------------------------------------------------------------------------------
		Variable      Storage   Display    Value
		    name         type    format    label      Variable label
		------------------------------------------------------------------------------------------------------------------------------------------
		idcode          int     %8.0g                 NLS ID
		age             byte    %8.0g                 Age in current year
		race            byte    %8.0g      racelbl    Race
		married         byte    %8.0g      marlbl     Married
		never_married   byte    %16.0g     nev_mar    Never married
		grade           byte    %8.0g                 Current grade completed
		collgrad        byte    %16.0g     gradlbl    College graduate
		south           byte    %9.0g      southlbl   Lives in the south
		smsa            byte    %9.0g      smsalbl    Lives in SMSA
		c_city          byte    %16.0g     ccitylbl   Lives in a central city
		industry        byte    %23.0g     indlbl     Industry
		occupation      byte    %22.0g     occlbl     Occupation
		union           byte    %8.0g      unionlbl   Union worker
		wage            float   %9.0g                 Hourly wage
		hours           byte    %8.0g                 Usual hours worked
		ttl_exp         float   %9.0g                 Total work experience (years)
		tenure          float   %9.0g                 Job tenure (years)
		------------------------------------------------------------------------------------------------------------------------------------------
		Sorted by: idcode  */
notes
		  /*_dta:
		  1.  1988 data, extracted from National Longitudinal of Young Woman who were ages 14-24 in 1968 (NLSW).
		  2.  This dataset is the result of extraction and processing by various people at various times.
		  3.  For more information on the NLS, see http://www.bls.gov/nls/.  */
		
graph bar (mean) wage, over(smsa) over(married) over(collgrad) ///* Note the order of the three "over()"
			  title("Average Hourly Wage, 1988, Women Aged 34-46") ///
			  subtitle("by College Graduation, Marital Status, and SMSA residence") ///
			  note("Source: 1988 data from NLS, U.S. Dept. of Labor, Bureau of Labor Statistics") ///
			  graphregion(fcolor(white))  plotregion(fcolor(white))

(4) Horizontal bar chart + (quantity) stack

use https://www.stata-press.com/data/r17/educ99gdp, clear
generate total = private + public
graph hbar (asis) public private, ///
			  over(country, sort(total) descending) stack  ///
			  title( "Spending on tertiary education as % of GDP, 1999", span pos(11) ) ///
			  subtitle(" ") ///
			  legend(region(lcolor(white))) ///
			  note("Source: OECD", span) ///
			  graphregion(fcolor(white))  plotregion(fcolor(white))

(4) Horizontal bar chart + (scale) stack

use https://www.stata-press.com/data/r17/educ99gdp, clear
generate frac = private/(private + public) 
			  graph hbar (asis) public private, ///
			  over(country, sort(frac) descending) stack percent ///
			  title("Public and private spending on tertiary education, 1999", span pos(11) ) ///
			  subtitle(" ") ///
			  legend(region(lcolor(white))) ///
			  note("Source: OECD", span) ///
			  graphregion(fcolor(white))  plotregion(fcolor(white))

2. Two way bar plots

The (x, y) of twoway bar in the graph is numeric.

use https://www.stata-press.com/data/r17/sp500, clear
describe
	    /* Observations:           248                  S&P 500
		      Variables:             7                  22 Apr 2020 10:52
		                                              (_dta has notes)
		-----------------------------------------------------------------------------------------------------------------------
		Variable      Storage   Display    Value
		    name         type    format    label      Variable label
		-----------------------------------------------------------------------------------------------------------------------
		date            int     %td                   Date
		open            float   %9.0g                 Opening price
		high            float   %9.0g                 High price
		low             float   %9.0g                 Low price
		close           float   %9.0g                 Closing price
		volume          double  %12.0gc               Volume (thousands)
		change          float   %9.0g                 Closing price change
		-----------------------------------------------------------------------------------------------------------------------
		Sorted by: date */
list date close change in 1/5	    
		/*   +---------------------------------+
		     |      date     close      change |
		     |---------------------------------|
		  1. | 02jan2001   1283.27           . |
		  2. | 03jan2001   1347.56    64.29004 |
		  3. | 04jan2001   1333.34   -14.22009 |
		  4. | 05jan2001   1298.35   -34.98999 |
		  5. | 08jan2001   1295.86    -2.48999 |
		     +---------------------------------+ */
graph twoway bar change date in 1/50, ///
	    	  graphregion(fcolor(white))  plotregion(fcolor(white))

Its advantage is that it can be used in combination with other drawing types of twoway family.

use https://www.stata-press.com/data/r17/sp500, clear
graph twoway line close date, yaxis(1) || bar change date, yaxis(2) || in 1/50, ///
			  yscale(axis(1) r(1000 1400)) ylab(1200(100)1400, axis(1)) ///
			  ytick(1200(100)1400, axis(1) grid) ///
			  yscale(axis(2) r(-50 300)) ylab(-50(50)50, axis(2)) ///
			  ytick(-50(50)50, axis(2) grid) ///
			  legend(off) ///
			  xtitle("Date") ///
			  title("S&P 500") ///
			  subtitle("January - March 2001") ///
			  note("Source: Yahoo!Finance and Commodity Systems, Inc.", span) ///
			  yline(1150, axis(1) lstyle(foreground))  ///
			  graphregion(fcolor(white))  plotregion(fcolor(white))

3. Histograms

Draw the histogram of variable (varname). Unless the discrete option is specified, it is generally assumed that varname is a continuous variable. According to Beniger and Robyn (1978), although A. M. Guerry used histograms in his publication in 1833, the term "histogram" was first used by Karl Pearson in 1895.

(1) Histogram of continuous variables

use https://www.stata-press.com/data/r17/sp500, clear
histogram volume
graph save "$figures\histo_01", replace
histogram volume, fraction
graph save "$figures\histo_02", replace
graph combine "$figures\histo_01" "$figures\histo_02", row(1)
save "$figures\histo_0102", replace
graph export "$figures\histo_0102.png", replace

The histogram is called bin, and its number (k) is determined according to a mathematical rule:

k=min(\sqrt{N}, 10 \times \frac{ln(N)}{ln(10)})

Where N is the number of variables that can be observed. How to make better use of the statistical characteristics of continuous variables? Based on the above benchmark graph, we can use the following command to include the standard deviation information into the graph at the same time. It is also a more recommended histogram drawing method, which can be used in papers and research reports.

use "https://www.stata-press.com/data/r17/sp500", clear
sum volume //The sum command can help us get the statistics of variables
		/*
	    Variable |        Obs        Mean    Std. dev.       Min        Max
		-------------+-----------------------------------------------------
	      volume |        248    12320.68    2585.929       4103    23308.3     */

return list //View the calculated statistics, which are saved in "scalars"
	    /*scalars:
                  r(N) =  248
              r(sum_w) =  248
               r(mean) =  12320.67661290323
                r(Var) =  6687027.906981193
                 r(sd) =  2585.928828676689
                r(min) =  4103
                r(max) =  23308.3
                r(sum) =  3055527.8 */

*Find the mean r(mean)Sum standard deviation r(sd),Calculate the values corresponding to several units of deviation from the standard deviation             
display r(mean)+r(sd) //14906.605
display r(mean)+2*r(sd) //17492.534
display r(mean)+3*r(sd) //20078.463
display r(mean)+4*r(sd) //22664.392
display r(mean)-r(sd) // 9734.7478
display r(mean)-2*r(sd) //7148.819

*Draw graphics
histogram volume, freq normal kdensity ///
				  xaxis(1 2) ///
				  ylabel(0(10)80, grid) ///
				  xlabel(12320.68 "mean" ///* Mean=12320.68
				  9734.7478 "-1 s.d." /// 
				  14906.605 "+1 s.d." ///
				  7148.819 "-2 s.d." ///
				  17492.534 "+2 s.d." ///
				  20078.463 "+3 s.d." ///
				  22664.392 "+4 s.d.", axis(2) grid gmax) ///
				  xtitle("", axis(2)) ///
				  subtitle("S&P 500, January 2001 - December 2001") ///
				  note("Source: Yahoo! Finance and Commodity Systems, Inc.") ///
				  graphregion(fcolor(white))  plotregion(fcolor(white))

(2) Histogram of discrete variables

Use the discrete option to treat variables as discrete rather than continuous, even though the variables themselves may be continuous. At this time, each unique value of the variable will have a bin, so the number of columns is also large. The height of each column represents the density, frequency, percentage or proportion corresponding to the value.

use https://www.stata-press.com/data/r17/auto, clear
histogram mpg //mpg would be treated as continuous and categorized into eight bins by the default number-of-bins calculation (here N=74)
graph save "$figures\histo_discrete01", replace
histogram mpg, discrete //Adding the discrete option makes a histogram with a bin for each of the 21 unique values
graph save "$figures\histo_discrete02", replace
histogram mpg, discrete freq addlabels ylabel(,grid) xlabel(10(10)40) xtick(10(5)40,grid)
graph save "$figures\histo_discrete03", replace
graph combine "$figures\histo_discrete01" "$figures\histo_discrete02" "figures\histo_discrete03", row(1)
graph export "figures\histo_discrete010203.png", replace

(3) Histogram using weight information

use https://www.stata-press.com/data/r17/voter, clear

describe    
		/*Observations:            15                  1992 U.S. presidential voters
		     Variables:             5                  3 Mar 2020 14:27
		                                               (_dta has notes)
		-----------------------------------------------------------------------------
		Variable      Storage   Display    Value
		    name         type    format    label      Variable label
		-----------------------------------------------------------------------------
		candidat        int     %8.0g      candidat   Candidate voted for, 1992
		inc             int     %8.0g      inc2       Family income
		frac            float   %9.0g                 
		pfrac           double  %10.0g                
		pop             double  %10.0g                
		-----------------------------------------------------------------------------
		Sorted by: inc  */
label list candidat
		/*candidat:
           2 Clinton
           3 Bush
           4 Perot */
histogram candi [fweight=pop], discrete fraction by(inc, total) /// *frequency weights
 				   barwidth(1) gap(40) xlabel(2 3 4, valuelabel) /// *place a gap between the bars by reducing bar width by #%
 				   graphregion(fcolor(white)) plotregion(fcolor(white))

It is worth noting that we can also implement the above example with bar chart, but the object of drawing has changed.

Through this set of examples, we can better understand the three commands.

use https://www.stata-press.com/data/r17/voter, clear
graph bar frac, over(candidat) by(inc, total)
graph save "$figures/histogram_bar",replace
graph twoway bar frac candidat, by(inc, total) xlabel(2 3 4, valuelabel) yscale(r(0 100))
graph save "$figures/histogram_2waybar",replace
graph combine "$figures/histogram_bar" "$figures/histogram_2waybar", row(2)
graph save "$figures/histogram_bar & 2waybar",replace
graph export "$figures/histogram_bar & 2waybar.png",replace

4. Histogram plots

There is almost no difference between twoway histogram and the histogram presented above, and the latter can superimpose the normal density function or kernel density estimation on the histogram, which also makes the advantage of the latter. Therefore, in practical application, histogram is recommended.

The above is the content of this paper. The essence of drawing lies in: (1) making clear what kind of graphics to draw with the available data (you can inspire yourself through visual images or referring to other people's works); (2) Select the appropriate drawing command (such as using graph bar or twoway graph bar); (3) Through various drawing options, the drawn graphics are more beautiful and self explanatory. Subsequent articles will break down the main graphic types one by one, and let the data speak with pictures!

reference material

  1. StataCorp. (2021). [G] Stata Graphics Reference Manual, Stata: Release 17. Statistical Software. College Station, TX: StataCorp LLC.
  2. Michael Mitchell. (2012) .Visual Guide to Stata Graphics(Third Edition), Published by Stata Press.

Topics: data visualization