# Data Analysis

## Horizontal Tabs

Data analyses will be carried out in the LTRC for two main purposes. One is to monitor CC performance and the second is to perform appropriate analyses of all study data with particular emphasis on evaluation of comparisons among participant groups. It is anticipated that recruitment and status reports will be generated on a monthly schedule and monitoring reports providing more detailed information about data quality and performance of the CCs and CLs will be generated on quarterly performance reports that will be presented to the SC, the NHLBI and the OSMB. Analyses of study specific objectives will be performed on an "as needed basis." This will include support, as directed by the NHLBI and the OSMB, for discrete and collaborative studies between the LTRC and other investigators.

Primary analyses for each study will focus on estimating group differences for the designated primary end point and developing statistical models to determine associations and relationships between dependent variables and risk factors. Particular analysis methods will depend on the type of study being performed and the type of end point and covariates being collected.

It will be essential to adjust study results for potential confounding factors. Three general approaches are proposed: matching, stratification, and regression analyses. In observational studies the group divisions can lead to different profiles of known risk factors in the groups because the group divisions are not randomized. This creates problems when trying to determine which of the risk factors are responsible for a particular outcome. Matching and stratification represent a non-parametric way to adjust for potential confounding variables while regression models represent a parametric method. Matching and stratification are used to "adjust out" confounding effects when one is not interested in estimating those effects and regression models are used when one is interested in comparing the effects due to a collection of risk factors. Both of these techniques are important when performing outcome analyses. Not only do regression models control for potential confounding from markers and cofactors, they can also be used to determine if effect modifiers are present which accelerate or delay cellular and biochemical changes involved in the progression of a lung disease. These effect modifiers are usually included in regression models in the form of interactions.

Since there are no plans to follow participants after they are enrolled into the study, it will be necessary to construct disease progression profiles from the individual data points that have been collected. Regression methods will be used to accomplish this aim. As part of the data collection efforts, the LTRC investigators will be asked to determine (or estimate) the amount of time that has transpired since the participant first developed the conditions that lead to the disease. It is expected that these estimates will be crude, but it may be possible to at least postulate that the disease is in its early stages, middle stages, or late stages.

If it is possible to actually estimate an "age" at which the disease process was initiated, the time since initiation of the disease is the difference between age of the participant at the time of the visit and the age at which the disease was initially developed. A time measure such as this can be used as a blocking factor in regression analyses, or as an actual regressor (independent variable). The identification of such a variable would allow the LTRC investigators to compare the cellular and biochemical progression of the different lung diseases in a very meaningful way. Under this scenario, it would be possible to develop disease type by time interactions to compare and contrast the different cellular and biochemical changes that take place between the different lung diseases being studied in the LTRC.

If it is only possible to create definitions for early, middle, and late stages, it will be more difficult to compare the disease processes of the different lung diseases since it will not be possible to determine if "early," "middle", and "late" mean the same thing in each disease. In this case, the "timing" variable would either not make sense or would be difficult to interpret when included in interaction analyses. However, the ordinal nature of the determinations would allow meaningful comparisons within a specific disease, and one could track the cellular and biochemical changes that occur within a lung disease type.

The DCC will develop regression and analysis of variance models that will provide meaningful interpretations of the LTRC data. It is anticipated that standard analysis methods would be used for continuous variables and for categorical variables, logistic regression would be used. For linear regression, tests will be made to determine the goodness of fit by examining the residuals of the analysis using influence analysis and the Hosmer-Lemeshow test (14) will be used to test residuals.

Some missing data are anticipated. If data are missing at random, there will be a loss in efficiency of the proposed analyses, but bias will not be introduced into the study by not accounting for the missing data. We have assumed that we will have outcome information on 85% follow-up on LTRC participants. Per Appendix C this is at least 1300 participants through 2010 and 2700 through 2014. A review of power curves and tables for sample numbers up to 1600 participants, suggests that, there will be sufficient power to address the study objectives, even when the sample of evaluable participants is lower than 1000.

However, if data are not missing at random, there could be a bias in some of the estimation and inference routines used in this study. We will use the multiple imputation procedure developed by Rubin to correct for this type of bias (15).

A preliminary power analysis has been performed for the LTRC. It has been found that the proposed __number of participants (N = approximately 1600) is sufficient to allow for small to moderate effects to be detected with adequate power (at least 80%) when testing at the alpha level of 0.05 (two-sided tests) and for larger effect sizes if subgroup analyses are to be performed.__

The LTRC offers a unique opportunity to incorporate variables depicting cellular morphology differences, lung morphology differences, lung tissue biochemical differences and genetic differences of different diseases to show how these variables impact the progression of the various lung diseases. The models presented here are general and robust enough to allow important research to be done in this area.