博文

Programming Advice

已有 3872 次阅读 2015-2-15 13:20 |个人分类:STATA|系统分类:论文交流

数据装置策略汇总20150219.docx

Programming Advice

My purpose in writing this paperwas to make sure researchers (myself included) understood what each of themethods for estimating standard errors was actually doing. These pages aremeant to help researchers use the correct techniques. Code which is easilyavailable is more likely to be used. Since I program in Stata, most of the instructionsbelow are for Stata. I have also included code in other languages (written byother generous academics) at the end of this page. Questions should be directedto the authors, as I am not familiar with the code. If you know how to do thisin other languages, please let me know. I am happy to post links to theinstructions. I have also posted a test data set(in text and in stata format) along with the standard errors estimated byseveral different methods using this data. You can use these results to verifythat your routines are producing the same results. With all of theinstructions, the programming instructions are in bold. The variable nameswhich the user must specify are in italics. I have also included a sample ofthe Stata program which I used to run thesimulations (i.e. simulated the data sets and then estimated thecoefficients and standard errors). Although I have posted these instructions, Iunfortunately, do not have time to respond to all programming questions.

Stata ProgrammingInstructions

The standard command for running a regression in Stata is:

regress dependent_variable independent-_variables,options

Clustered(Rogers)Standard Errors – One dimension

To obtain Clustered (Rogers)standard errors (and OLS coefficients), use the command:

regress dependent_variable independent_variables,robust cluster(cluster_variable)

This produces White standard errors which are robust to within clustercorrelation (clustered or Rogersstandard errors). If you wanted to cluster by year, then theclustervariable would be the year variable. If you wanted to cluster by industry andyear, you would need to create a variable which had a unique value for eachindustry-year pair. These standard errors would allow observations in the sameindustry/year to be correlated (i.e. different firms), but would assume thatobservations in the same industry, but different years, are assumed to beuncorrelated. To allow observations which share an industry or share a year tobe correlated, you need to cluster by two dimensions (industry and year). Theseinstructions follow.

For most estimation commands such as logits and probits, the previous formof the command will also work. For example, to run a logit with clusteredstandard errors you would use the command:

logit dependent_variable independent_variables,robust cluster(cluster_variable)

ClusteredStandard Errors – Two dimensions

The routines currently written into Stata allow you to cluster by only onevariable (e.g. one dimension such as firm or time). Papers by Thompson (2006)and by Cameron, Gelbach and Miller (2006) suggest a way to account for multipledimensions at the same time. This approach allows for correlations amongdifferent firms in the same year and different years in the same firm, forexample. See their papers and mine for more details and caveats. I have writtena Stata ado file to implement this estimationprocedure. It runs a regression and calculates standard errors which accountfor two dimensions of within cluster correlation. The variables which recordthe two dimensions (e.g. a firm identifier and a time identifier) are specifiedin the required options: flcuster( ) and tcluster( ). There are also versionsof the Stata ado file that estimates logit (logit2.ado),probit (probit2.ado), or tobit (tobit2.ado)models with clustering on two dimensions. The format is similar to thecluster2.ado command.

cluster2 dependent_variable independent_variables,fcluster(cluster_variable_one) tcluster(cluster_variable_two)

If there are multiple observations per firm-year (e.g. loan data sets whichhave multiple loans per firm in a given year), then the method described in mypaper needs to be modified. In this case, instead of subtracting off the Whitevariance matrix, you need to subtract off the variance matrix clustered byfirm-year (i.e. for correlation among observations with the same firm AND thesame year -- see Cameron, Gelbach, and Miller (2006) for details). The programhas been modified to automatically check for this condition and use the correctthird matrix. The program is also now compatible with the outreg procedure.

The code for estimatingclustered standard errors in two dimensions using R is available here.

Fama-MacBethStandard Errors

Stata does not contain a routine for estimating the coefficients andstandard errors by Fama-MacBeth (that I know of), but I have written an adofile which you can download. The ado file fm.ado runs across-sectional regression for each year in the data set. The program allowsyou to specify a by variable for Fama-MacBeth. Thus if in stead of running Tcross-sectional regressions, you could run N time series regressions byspecifying the firm identifier as the byfm( ) variable. If the option is notspecified, it uses the time variable (as set by the tsset comment) as the byvariable. The program is also now compatible with the outreg procedure.

The form of the command is:

fm dependent_variable independent_variables,byfm(by_variable)

Prior to running the fm program, you need to use the tsset command. Thistells Stata the name of the firm identifier and the time variable. The form ofthis command is:

tsset firm_identifier time_identifier

The program will accept the Stata in and if commands, if you want to do theregression for only certain observations. JudsonCaskey, who showed me how to use the tsset command in the FM program, hasalso modified the program. His versionreports the number of positive or negative coefficients and the number whichare significant (and positive or negative). Another version (xtfmb.ado) hasbeen written by DanielHoechle. To install this ado file from with in Stata type net search xtfmb. A full description isin the help file.

NeweyWest for Panel Data Sets

The Stata command newey will estimate the coefficients of a regression usingOLS and generate Newey-West standard errors. If you want to use this in a paneldata set (so that only observations within a cluster may be correlated), youneed to use the tsset command.

tsset firm_identifiertime_identifier

newey dependent_variableindependent_variables, lag(lag_length)force

Where firm_identifier is the variable which denotes each firm (e.g.cusip, permn, or gvkey) and time_identifier is the variable thatidentifies the time dimension, such as year. This specification will allow forobservations on the same firm in different years to be correlated (i.e. a firmeffect). If you want to allow for observations on different firms but in the sameyear to be correlated you need to reverse the firm and time identifiers. If youare clustering on some other dimension besides firm (e.g. industry or country),you would use that variable instead. You can specify any lag length up to T-1,where T is the number of years per firm.

FixedEffects

Stata can automatically include a set of dummy variable for each value ofone specified variable.

The form of the command is:

areg dependent_variableindependent_variables, absorb(identifier_variable)

Where identifier_variable is a firm identifier (e.g. cusip, permn, orgvkey) if you want firm dummies or a time identifier (e.g. year) if you wantyear dummies. If you want to include both firm and time dummies, only one setcan be included with the absorb option. The other must be included manually(e.g. by manually including a full set of time dummies among the independentvariables, and then using the absorb option for the firm dummies).

To create a full set of dummy variables from an indexed variable such asyear you can use the following command:

tabulate index_variable, gen(dummy_variable)

This will create a set of dummy variables (e.g. dummy_variable1,dummy_variable2, etc), which are equal to one if the index_variabletakes on its first value and zero otherwise (in the case of dummy_variable1).

A more elegant way to do this is to use the xi command (as recommended by Prof Nandy). This allows you to include aset of dummy variables for any categorical variable (e.g. year or firm),including multiple categorical values. To include both year and firm dummies,the command is:

xi: areg dependent_variableindependent_variables i.year,absorb(firm_identifier)

where year is the categorical variable for year and firm_identifier is thecategorical variable for firm. The coefficients on T-1 of the year variableswill be reported, the coefficients on the firm dummy variables will not. To seethe coefficients on both sets of dummyvariables you would use the command:

xi: reg dependent_variableindependent_variables i.year i.firm_identifier

GeneralizedLeast Squares

When the residuals are correlated within a cluster, not only are the OLSstandard errors biased but the slope coefficients are not efficient. One methodfor taking advantage of the additional information in the residuals (andgenerating more efficient estimates) is to estimate a random effects modelusing a generalized least squares approach. I used the xtreg command toestimate the GLS results reported in the paper.

The form of the command is:

xtreg dependent_variableindependent_variables, i(firm_idenifier)

As with the regress commend,standard errors which are robust to within cluster correlation can be producedby including the option cluster(firm_idenifier)

xtreg dependent_variableindependent_variables, i(firm_idenifier)cluster(firm_idenifier)

BootstrappedStandard Errors

The Stata command bootstrap will allow you to estimate the standard errorsusing the bootstrap method. This will run the regression multiple times and usethe variability in the slope coefficients as an estimate of their standarddeviation (intuitively like I did with my simulations).

The form of this command is:

bootstrap “regress dependent_variable independent_variables” _b, reps(number_of_repetitions)

Where number_of_repetitions samples will be drawn with replacementfrom the original sample. Each time the regression will be run and the slope coefficientswill be saved, since _b is specified. Both the average slope and its standarddeviation will be reported. As specified, the bootstrapped samples will bedrawn a single observation at a time. If the observations within a cluster(year or firm) are correlated, then these bootstrapped standard errors will bebiased. To account for the correlation within cluster it is necessary to drawclusters with replacement oppose observations with replacement. To do this inStata, you need to add the cluster option. In this case, the command is:

bootstrap “regress dependent_variable independent_variables” _b, reps(number_of_repetitions) cluster(cluster_variable)

SAS Programming Instructions

Although I did not work in SAS, Tanguy Brachetwas kind enough to explain how to do some of the estimation in SAS. A brief description follows.

ClusteredStandard Errors

The standard command for running an OLS regression in SAS and getting theClustered/Rogers standard errors is:

proc surveyreg data=mydata;
cluster cluster_variable;
model dependent variable = independent variables;

This produces White standard errors which are robust to within clustercorrelation (Rogersor clustered standard errors), when cluster_variable is the variable by whichyou want to cluster. If you clustered by firm it could be cusip or gvkey. Ifyou clustered by time it could be year. SAS allows you to specify multiplevariables in the cluster statement (e.g. firm and year). However, this does not produce standard errorsclustered by two dimensions described in my paper.

ClusteredStandard Errors – Two dimensions

SAS does not contain a routine to do this, but you can find SAS code for estimatingstandard errors clustered on two dimensions on this web site (Mark Ma).

FixedEffects

If you want to include dummy variables for one dimension (time) and clusterby another dimension, you need to create the dummy variables. A simple way isas follows:

data new;
set old;
year1 =(year=1991);
year2 =(year=1992);
year3 =(year=1993);
year4 =(year=1994);
year5 =(year=1995);

Alternative specifications can be found on Noah Stoffman’s pages. As SAS is notmy traditional language, this code is provided just as information. I have usedboth the SAS and Stata code to verify that the results produced by both sets ofinstructions (SAS and Stata) are the same based on a testdata set.

R ProgrammingInstructions

The programs in R are written by MahmoodArai. Questions can be directed to him.

Clustered Standard Errors – Twodimensions

Simulated Data Sets

Many of the results in the paper are based on simulatingdata sets with a specified dependence (firm and/or time effect). For those whoare interested in seeing how this was done or for researchers who wantsimulation results for different data structures, I have posted a stripped downversion of the simulation program.This program simulates a data set with a firm effect and then estimates thecoefficients using OLS and Fama-MacBeth. The program estimates OLS standarderrors, standard errors clustered by firm, and Fama-MacBeth standard errors.The results are saved for each iteration, and the means and standard deviationsare calculated and displaced. To run the program simulation.do,you need to type

do simulation firm_effect_x firm_effect_r number_of years

where firm_effect_x is the percent of the independent variable’s variancewhich is due to the firm effect [i.e. rho(x)], firm_effect_r is the percent ofthe residual’s variance which is due to the firm effect [i.e. rho(r)], andnumber_of_years is the number of time periods per firm in the data set. Rho(x) and rho(r) shouldbe between 0 and 1.0. The data set will have 5,000 observations (although thiscan be changed), so the number of firms is 5,000/number_of_years. Otherparameters can be changed by editing the program. This example is just meant toprovide intuition of how I did the simulations. If you have questions about thispage, you are welcome to e-mail me. I can not promise an immediate response,but I will try to get back to you. I unfortunately, can’t help you debug yourstata (or non-stata) programs. However, by posting these instructions I hope tomake it easier to use the methods discussed in my paper.

转载本文请联系原作者获取授权，同时请注明本文来自杨金海科学网博客。
链接地址：https://wap.sciencenet.cn/blog-793574-868204.html

上一篇：postfile command
下一篇：MATLAB 包 01

收藏 IP: 111.203.16.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

杨金海

扫一扫，分享此博文

jungsee的个人博客分享 http://blog.sciencenet.cn/u/jungsee

博文

Programming Advice

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

杨金海

全部作者的其他最新博文

全部精选博文导读

相关博文

jungsee的个人博客分享 http://blog.sciencenet.cn/u/jungsee

博文

Programming Advice

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

杨金海

全部作者的其他最新博文

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (0 个评论)