# Regression Analyses and Adjustment

It will be essential to adjust study results for potential confounding factors. Three general approaches are proposed: matching, stratification, and regression analyses. In observational studies the group divisions can lead to different profiles of known risk factors in the groups because the group divisions are not randomized. This creates problems when trying to determine which of the risk factors are responsible for a particular outcome. Matching and stratification represent a non-parametric way to adjust for potential confounding variables while regression models represent a parametric method. Matching and stratification are used to "adjust out" confounding effects when one is not interested in estimating those effects and regression models are used when one is interested in comparing the effects due to a collection of risk factors. Both of these techniques are important when performing outcome analyses. Not only do regression models control for potential confounding from markers and cofactors, they can also be used to determine if effect modifiers are present which accelerate or delay cellular and biochemical changes involved in the progression of a lung disease. These effect modifiers are usually included in regression models in the form of interactions.

Since there are no plans to follow participants after they are enrolled into the study, it will be necessary to construct disease progression profiles from the individual data points that have been collected. Regression methods will be used to accomplish this aim. As part of the data collection efforts, the LTRC investigators will be asked to determine (or estimate) the amount of time that has transpired since the participant first developed the conditions that lead to the disease. It is expected that these estimates will be crude, but it may be possible to at least postulate that the disease is in its early stages, middle stages, or late stages.

If it is possible to actually estimate an "age" at which the disease process was initiated, the time since initiation of the disease is the difference between age of the participant at the time of the visit and the age at which the disease was initially developed. A time measure such as this can be used as a blocking factor in regression analyses, or as an actual regressor (independent variable). The identification of such a variable would allow the LTRC investigators to compare the cellular and biochemical progression of the different lung diseases in a very meaningful way. Under this scenario, it would be possible to develop disease type by time interactions to compare and contrast the different cellular and biochemical changes that take place between the different lung diseases being studied in the LTRC.

If it is only possible to create definitions for early, middle, and late stages, it will be more difficult to compare the disease processes of the different lung diseases since it will not be possible to determine if "early," "middle", and "late" mean the same thing in each disease. In this case, the "timing" variable would either not make sense or would be difficult to interpret when included in interaction analyses. However, the ordinal nature of the determinations would allow meaningful comparisons within a specific disease, and one could track the cellular and biochemical changes that occur within a lung disease type.

The DCC will develop regression and analysis of variance models that will provide meaningful interpretations of the LTRC data. It is anticipated that standard analysis methods would be used for continuous variables and for categorical variables, logistic regression would be used. For linear regression, tests will be made to determine the goodness of fit by examining the residuals of the analysis using influence analysis and the Hosmer-Lemeshow test (14) will be used to test residuals.