Data Management Plan Sample 3
GI SPORE (Gastroenterology/Colon Cancer – Specialized Program of Research Excellence, NCI): Informatics
The informatics requirements for the support of translational research are more complex than those for trials with purely clinical endpoints, because they involve data collection from many diverse sources and protocols (e.g. clinical data, laboratory processes and instruments, biospecimens, both raw and processed laboratory data), which must be integrated for analysis. Informatics involvement in study design and implementation is critical, and, in the course of this application, the Biostatistics and Informatics Core collaborated actively with the projects and other cores to ensure that all data collection, storage, integration and analysis needs will be met. The informatics plan for this SPORE balances the need for a central database with the cost of perturbing established information management systems within clinics and laboratories. GI SPORE Informatics serves as the communications hub for all studies, storing all data descriptions, information flow documentation, and study protocols for the SPORE projects, and facilitating the integration of all data in response to reporting and analysis needs, with careful consideration for specimen and data identification protocols and attention to interoperability of data sources. After extensive consultation with Dr. Brenner, Dr. Normolle, and all of the GI SPORE project leaders, the following principles emerged for the scope of service, serving the needs of the GI SPORE as a whole, while allowing flexibility when dealing with individual projects:
1. The core tracks all biospecimens collected in GI SPORE clinical trials and population studies from collection to storage to assay.
2. Core staff works with GI SPORE investigators to develop, collect and store metadata describing all laboratory experiments and outcomes in compliance with relevant NIH and NCI standards.
3. The core stores all data, clinical and laboratory, resulting from translational clinical trials and population studies in a GCP-compliant data management system.
4. The core collects and manages laboratory data for some but not necessarily all GI SPORE studies that are not clinical trials or population studies.
5. The core serves as the primary data portal for the biostatisticians.
6. To facilitate NIH data sharing requirements, the core serves as a repository for all data in presentations and publications.
The core tracks all biospecimens collected in GI SPORE clinical trials and population studies from collection to storage to assayThe BioInformatics Service Center at Dartmouth (BSC) has extensive experience designing the flow and tracking of biological specimens using barcoding and web-based sample scanning and will implement these procedures and technology as needed for the GI SPORE. For instance, the GLNE database system has tracked 1116 biospecimen shipments and logged 55,239 individual specimens. Despite the high volume of biospecimens, only twenty-three records have emerged as incomplete, and each incomplete record was reconciled by BSC staff and clinical centers. The BSC will use the GLNE system design and reusable software components to create the necessary biospecimen management infrastructure for this effort.
Core staff works with GI SPORE investigators to develop, collect and store metadata describing all laboratory experiments and outcomes in compliance with relevant NIH and NCI standards. Compliance with NCI and NIH data and system standards facilitate cooperative data exchange among initiatives. The National Cancer Institute has launched several initiatives to facilitate cooperative development, standardization and sharing of data models and tools, as well as cooperative use of research data itself. The BSC, under Ms. Anton’s leadership, will develop the informatics systems for this GI SPORE with consideration not only for communication and data sharing among these GI SPORE projects, but also for the compatibility standards being developed in national initiatives like the EDRN, caBIG, and CFR. Ms. Anton, as a funded participant in the caBIG Populations Sciences and Tissue Bank & Pathology Special Interest Groups, and a non-funded contributor to the Clinical Trial Management System Workspace, is actively engaged in the development of these networks and the emerging data element system and metadata standards. Ms. Anton will work with the GI SPORE and the caBIG developers to deposit all appropriate data elements in the caDSR (Cancer Data Standards Repository). Ms. Anton maintains a long-standing collaboration with Daniel Crichton, MS, at the NASA Jet Propulsion Laboratory (JPL). JPL, under Mr. Crichton’s leadership, has adapted a novel distributed data-sharing model, originally developed to integrate Martian exploration data, for application in the life sciences. Under this paradigm, study data is stored at its source location at all times, and an XML descriptive profile is created to map data to NCI standards, implement data transformations, and facilitate data integration. Consequently, data sources are neither moved nor duplicated, yet an integrated result from any number of data sources may be returned to answer a query through one server and a single web interface. The application of such an interface to a translational research entity comprised of independent laboratories, such as the GI SPORE, is obvious and immediate.
The core stores all data, clinical and laboratory, resulting from translational clinical trials and population studies in a GCP-compliant data management system. The gold standard for systems built and maintained by the BSC for clinical data is the FDA “Guidance for Industry: Computerized Systems Used in Clinical Investigations” (http://www.fda.gov/cder/guidance/7359fnl.htm). Per this guideline, all systems will have, for instance, audit of data and metadata change, date/time stamps, restricted access to web interfaces and the data itself, and appropriate back-up and archive of data and systems. All servers used for GI SPORE systems will be part of the BSC local area network, which are protected by firewalls using stateful packet inspection, and protocol and application inspection. All computer systems are password protected. Network access to all computers and servers is limited by the use of usernames and passwords. System access to computer systems is audited. Virus protection is installed, maintained, and audited on all computers, and all computer systems are updated regularly with applicable operating system and security patches. Physical access to servers is restricted to authorized persons. Backups and off-site backup storage are used in the unlikely event that a hardware failure, a disaster, or security breach should occur. Restore procedures are tested regularly, as a quality control measure of the back-up procedures.
The core collects and manages laboratory data for some but not all GI SPORE studies that are not clinical trials or population studies. All of the laboratories collaborating in the GI SPORE are established, and direct management by the Core of data from pilot experiments, intermediate results, trial runs, etc., is impractical and unnecessary. As described above, a minimally invasive alternative is to characterize the metadata of key experimental results, and establish protocols for the presentation of those data to the core database, so that they may be accessed by the core statisticians for analysis in collaboration with the project leaders. However, some experiments may be sufficiently complex that the laboratories request external support for data management. The core can support, by means of existing, reusable code modules, any data management tasks the laboratories may require to achieve their objectives. The fundamental database system component which will be used to manage GI SPORE data is a tool which has been developed by the BSC called eQuest. Functionally, the eQuest tool is a database system that manages the data elements and logic of interrelated variables within the system itself. From its database, eQuest builds data entry interfaces through any web browser automatically, relieving the need to maintain many static web pages, and making interface changes trackable and simple to implement. Because study data hits the database in real time, and inter- and intra-variable operations are therefore possible in real time, data can be validated at the point of entry. The eQuest tool is currently in production in many active research studies, allowing researchers to monitor accrual rates and trends, critical end-point data, and adverse events without processing delays.
The core serves as the primary data portal for the biostatisticians. Data are accessible to the biostatisticians either by their specification of analysis sets and generation of data files by BSC staff, or by extraction of raw data via connection directly to the central database tables. The central database is created using Structured Query Language, a database programming language which allows flexibility in the choice of interface, so that web pages, graphical user interfaces (GUIs), and other software drivers (like SAS and STATA) may interact with the back-end database. By using the core database as a primary portal, the statisticians are relieved of data cleaning and file merging tasks, freeing them to work on data analysis, rather than data management.
To facilitate NIH data sharing requirements, the core serves as a repository for all data in presentations and publications. Because the core serves as the data portal for the statisticians, who will support data analysis for publication for all the projects, the core will automatically have possession of all published experimental data, and associated metadata, produced by the GI SPORE. This centralized repository will facilitate secure, structured, authorized and reliable data sharing with the scientific community outside of the GI SPORE.
Facilities, Environment, and Resources
The central database for this project will be implemented on the robust technical infrastructure already in place at Dartmouth. The BioInformatics hardware, network, and security infrastructure includes development, test, and production environments for all of our system layers, including database and web. We maintain a bank of forty servers, two of which run VMWare ESX 3.5, 12 of which run Ubuntu LTS 6.06 Linux, four of which run Windows 2000 Server and 22 of which Windows 2003 Server. The primary networking OS is Microsoft Networking using Microsoft’s Active Directory on a TCP/IP network. All servers are stored within a dedicated, temperature-controlled server room in the BioInformatics suite of offices at the EverGreen Center in Lebanon, New Hampshire. BioInformatics has a local area network, which is protected by firewall, and accesses the world wide web through the Dartmouth College network. The Service Center includes 1.5 FTE dedicated to network, hardware and software infrastructure management.