MmCIF In Structural Bioinformatics - IUCr

1y ago
2 Views
2 Downloads
5.45 MB
27 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Lee Brooke
Transcription

mmCIF in Structural BioinformaticsJohn WestbrookRutgers, The State University of New Jerseywww.wwpdb.org

Overview§ Brief history of mmCIF development &implementation§ How the wwPDB archives structure,experimental, and reference data§ How mmCIF is helping to address currentchallenges in data archiving§ Recent developments in wwPDB datadeposition and delivery

PDBx/mmCIF Development Timeline 1991 1994 1997IUCr mmCIF Working PartyCore CIF V1 20002003 2006 2009 2012IUCr mmCIF Maintenance GroupmmCIF V1mmCIF RutgersSt. LouisGlasgowSeattleDDL 1RutgersCARBHonoluluOrlandoEBIDDL 2mmCIF ExtensionsPDB Exchange DictionarywwPDBOne Archive – One DictionaryDatammCIFwwPDB CommonDeposition &AnnotationSystem

PDB Exchange DictionaryScientific Content§ Coordinate and supporting primary data§ Experimental descriptions for: X-ray/Neutron diffraction,NMR, Electron Microscopy, SAS & hybrid methods§ § § § Protein productionMolecular and chemical representationBiological and functional annotationAdditional derivative data –§ Functional assemblies, validation details, coordinate frametransformations, secondary & tertiary structural features, nucleicacid structural features, http://mmcif.pdb.org/

PDB Exchange DictionaryMetadata Content§ Features of Data Items§ § § § Definitions and examplesData types (primitives & regular expression patterns)Boundary valuesControlled vocabularies§ Simple organization§ Tables and columns (categories)§ Related data item sets (subcategories)§ Chapters (category groups)§ Associations§ Referential integrity - parent-child relationships§ Interdependencies/exclusivity§ Methodshttp://mmcif.pdb.org/

TheCentralRoleoftheDataDic1onaryPDB ExchangeDictionaryPDBx/mmCIFValidationToolsXML SchemaRDF/OWLDeposi'onToolsSQL SchemaHarves'ngToolsPDBx FormatToolsRDFPDBmmCIFXML

Current Supported Archival Formatsprotein structure format universePDB (ca. 1974)PDBx/mmCIF (ca. 1997)PDBML (ca. 2005)RDF (ca. 2011)PDBPDBx/mmCIFPDBML&RDFIn managing the formats, PDBx is the master format.

PDB Format ARK!!ATOMATOMATOMATOMATOM3333333333333333DATA USED IN REFINEMENT.RESOLUTION RANGE HIGH (ANGSTROMS)RESOLUTION RANGE LOW (ANGSTROMS)DATA CUTOFF(SIGMA(F))COMPLETENESS FOR RANGE(%)NUMBER OF REFLECTIONSFIT TO DATA USED IN REFINEMENT.CROSS-VALIDATION METHODFREE R VALUE TEST SET SELECTIONR VALUE(WORKING TEST SET)R VALUE(WORKING SET)FREE R VALUEFREE R VALUE TEST SET SIZE(%)FREE R VALUE TEST SET LL43316NULLNULLNULL0.1910.221NULL2189Record- ‐orientedwithfixedcolumnformatMetadatainsemi- yusedandsupportedarchivalformat .0033.3232.1331.7330.9433.45NCCOC

PDBx/mmCIF Format Example§ Name – value pairsexptl.entry idexptl.methodexptl.crystals number1XBB'X-RAY DIFFRACTION'1syntax§ Tables Simpleorloop ’sNameddataitemsloop Dataseman'csdefinedinthePDBxdatadic'onarydatabase PDB rev.numdatabase PDB rev.date SoJwaresupportinmostpopularlanguagesdatabase PDB rev.date originaldatabase PDB rev.mod typedatabase PDB rev.replacesdatabase PDB rev.status1 2004-11-02 2004-08-30 0 1XBB ?2 2005-03-22 ?1 1XBB ?3 2009-02-24 ?1 1XBB ?

PDBML Example PDBx:entity polyCategory PDBx:entity poly entity id "1" PDBx:type polypeptide(L) /PDBx:type PDBx:nstd linkage no /PDBx:nstd linkage PDBx:nstd monomer no /PDBx:nstd monomer PDBx:pdbx seq one letter code GTKLEIK /PDBx:pdbx seq one letter code ThreeflavorsofXMLfiles: PDBx:pdbx seq one letter code can fullymarked- HFWSTPRTFGGGTKLEIK fileswithoutatomrecords /PDBx:pdbx seq one letter code can fileswithamorespaceefficientencodingofatom /PDBx:entity poly records /PDBx:entity polyCategory Followsnamingandseman'csofthePDBxdatadic'onary

RDF Example ?xml version "1.0" encoding "UTF-8"? ?xml-stylesheet type "text/xml" href "http://pdbj.org/rdf-supp/pdbj-rdf.xsl" ? rdf:RDF xmlns:PDBo "http://pdbj.org/schema/pdbx-v40.owl#"xmlns:rdfs "http://www.w3.org/2000/01/rdf-schema#"xmlns:rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#" PDBo:PDBID rdf:about "http://pdbj.org/pdb/1GOF" rdfs:label 1GOF /rdfs:label /PDBo:PDBID rdf:Description rdf:about "http://pdbj.org/rdf/1GOF/entity/1" PDBo:entity.formula weight 68579.250 /PDBo:entity.formula weight PDBo:entity.id 1 /PDBo:entity.id PDBo:entity.pdbx description GALACTOSE OXIDASE /PDBo:entity.pdbx description PDBo:entity.pdbx ec 1.1.3.9 /PDBo:entity.pdbx ec PDBo:entity.pdbx number of molecules 1 /PDBo:entity.pdbx number of molecules PDBo:entity.src method man /PDBo:entity.src method PDBo:entity.type polymer /PDBo:entity.type PDBo:link to enzyme rdf:resource "http://purl.uniprot.org/enzyme/1.1.3.9"/ PDBo:of datablock rdf:resource "http://pdbj.org/rdf/1GOF"/ PDBo:referenced by entity keywords rdf:resource "http://pdbj.org/rdf/1GOF/entity keywords/1"/ PDBo:referenced by entity poly rdf:resource "http://pdbj.org/rdf/1GOF/entity poly/1"/ PDBo:referenced by entity src gen rdf:resource "http://pdbj.org/rdf/1GOF/entity src gen/1"/ PDBo:referenced by struct asym rdf:resource "http://pdbj.org/rdf/1GOF/struct asym/A"/ PDBo:referenced by struct ref rdf:resource "http://pdbj.org/rdf/1GOF/struct ref/1"/ rdf:type rdf:resource "http://pdbj.org/schema/pdbx-v40.owl#entity"/ /rdf:Description /rdf:RDF Entrypointforseman'cwebandreasoningsystems thURLiden'fiers Followsnamingandseman'csofthePDBxdatadic'onary hRp://pdbj.org/rdf/ pdbID / categoryName / pkey1 , Forexample,hRp://pdbj.org/rdf/1GOF/en'ty/1

Chemical Reference DataChemical Component Dictionary§ Library of all polymer and non-polymer chemicalcomponents in PDB§ 18,000 chemical component definitions§ 400 additional definitions of amino acidprotonation variants§ 700 new components released this year§ 1700 component definitions updated this year§ Complimentary to the CCP4 monomer library

Chemical Reference DataExampleloopchem comp atom.comp idchem comp atom.atom idchem comp atom.alt atom idchem comp atom.type symbolchem comp atom.chargechem comp atom.pdbx alignchem comp atom.pdbx aromatic flagchem comp atom.pdbx leaving atom flagchem comp atom.pdbx stereo configchem comp atom.model Cartn xchem comp atom.model Cartn ychem comp atom.model Cartn zchem comp atom.pdbx model Cartn x idealchem comp atom.pdbx model Cartn y idealchem comp atom.pdbx model Cartn z idealchem comp atom.pdbx ordinalHYP NNN 0 1 N N N -3.366 16.585 44.188HYP CACA C 0 1 N N S -2.955 15.768 43.044HYP CCC 0 1 N N N -1.447 15.609 43.030HYP OOO 0 1 N N N -0.722 16.484 43.503HYP CBCB C 0 1 N N N -3.408 16.578 41.829HYP CGCG C 0 1 N N R -4.437 17.482 42.330HYP CDCD C 0 1 N N N -4.068 17.803 43.753HYP OD1 OD O 0 1 N N N -5.693 16.815 42.294HYP OXT OXT O 0 1 N Y N -0.976 14.502 42.469HYP HHH 0 1 N Y N -3.980 16.047 44.765HYP HAHA H 0 1 N N N -3.385 14.756 43.068HYP HB2 1HB H 0 1 N N N -2.567 17.141 41.398HYP HB3 2HB H 0 1 N N N -3.790 15.930 41.026HYP HGHG H 0 1 N N N -4.508 18.399 41.726HYP HD22 1HD H 0 0 N N N -4.956 18.005 44.370HYP HD23 2HD H 0 0 N N N -3.457 18.713 43.848HYP HD1 HOD H 0 1 N N N -5.999 16.666 43.181HYP HXT HXT H 0 1 N N N -0.027 14.511 42.499#Atom namesStereochemistry & aromaticityModel coordinatesIdeal 0.009-0.098123456789101112131415161718

Biologically Interesting ReferenceMolecule Dictionary (BIRD)§ Contains 630 chemicaldefinitions for peptide inhibitorsand antibiotics§ Unifies the representation ofsmall polymers and singlemolecules with substantiallypolymeric chemical structure§ Provides structural andfunctional annotations§ Designed to facilitate bothsequence and detailedchemical structure searchesTargetHit

Keeping Pace with Structural Biology§ The most enduring and widely usedarchival PDB format is not keeping pacewith new science and technology.§ Efforts to work around PDB formatlimitations are increasingly problematic.PDBmmCIFPDBxReversible translation no longerpractical.mmCIF/PDBx

Challenges of Molecular Size§ PDB column format limitations§ § § § § 1-character for polymer chain labels5-characters for atom serial numbers3-characters for monomers and ligand identifiers5-characters for atom namesF8.3 for model coordinates§ Implications –1VOQ§ Maximum of 62 chains (upper and lower case!)§ Maximum of 99,999 atoms§ Requires splitting structures across multiple entries (5 ribosomesin ASU stored in 10 PDB entries!)§ Map and experimental validation are difficult for split entries§ Cannot use standard monomer & ligand nomenclatures(e.g. carbohydrates & protonation variants)§ Cannot use conventional atom names in large ligands§ Limits molecular dimension ( 9999.999 Angstroms)

Representing Evolving Content§ PDB Record format limitations§ The small number of named records are not extensible (e.g.ATOM, CONECT, SEQRES, ).§ Text REMARKS with ad hoc formats hold all other meta-data.§ Implications§ No bond orders are specified for ligands§ ATOM records tailored for traditional X-ray methodsand inflexible for newer methods (e.g. TLS groups)§ Sequence and coordinate residue correspondence isambiguous and there is no support for heterogeneity.§ Growing diversity and complexity of REMARK recordshas become unmanageable for both deposition andarchiving§ No standard and extensible way to represent anddocument meta-data

Key Format Goals for PDB§ Represent all PDB model structure, supportingexperimental and metadata§ Provide a working format for data exchange betweenthe laboratory and the archive§ Support the entire structure biology pipeline: modelbuilding, refinement, visualization, validation,analysis, simulation, prediction, New Formatin the osettaChimeraRound TripCurrentPDB FormatRDFNewFormatPDBML

Finding a Simple Format Alternative§ 2010 – started process of defining new format,consulting many software developers§ 2011 – Developers Workshop - agreement to adoptPDBx (mmCIF) as the new format and to phase out theold PDB format§ Commitments from CCP4, Phenix and Global Phasing (i.e., 85% of all PDB depositions)§ Agreement on managing development between these softwareproviders and wwPDB§ Established PDBx Deposition Working Group§ 2013 - Working Group recommendations andimplementations in CCP4 and Phenix.

PDBx/mmCIF Deposition Working GroupPDBx Deposition Working GroupRefinement Developers Workshop 2011 - EBIPDBx Formatin the Lab§ In 2011, charged with findinga “round trip” single formatthat can handle complex datanot supported by the PDB fileformat§ Consensus reached on usingdictionary-driven PDBx format§ Implementations delivered inJanuary 2013StructureDeterminationPipelineRound x Formatin wwPDBftp Archive

Working Group RecommendationsAnnounced 22-May-2013Format extensions for large structures:§ Atom serial numbers (1 to the number atoms)§ Chain identifiers up to 4 characters§ Cartesian coordinates with field widths as required and 3decimal places§ B-factors and occupancies with 3 decimal placesprecision.§ Implement extensions as required to as maintainbackward compatibility

Transitional Home forLarge StructuresLarge single entries are now stored separately onthe wwPDB ftp site, and PDB internally producesdivided/split PDB format files.ftp://ftp.wwpdb.org/pub/pdb/data/large large structures/XML/HIV-1 Capsid 3J3Q – 1356 chains 2M atoms 25 – PDB format entries3J3Y3J3Q

Providing Format Compatibility§ Adopt a PDB friendly mmCIF/PDBx style § All records on a single text line§ Columns presented in standard column order.§ Tabular presentation with leading record names(e.g. ATOM, CELL, REFINE)§ Method independent features in left-most columns(e.g. identifiers & coordinates)§ Method specific features in the right-most columns(e.g. ADPs, NMR order/disorder parameters)§ Continue to support PDB nomenclature semantics(e.g. PDB style chains, residue numbering, and insertion codes)§ Large entries will be internally converted todivided/split PDB format files.

AAAAA39393939394040404040loopatom site.group PDBatom site.idatom site.auth atom idatom site.type symbolatom site.auth comp idatom site.auth asym idatom site.auth seq idatom site.Cartn xatom site.Cartn yatom site.Cartn zatom site.pdbx PDB model numatom site.occupancyatom site.pdbx auth alt idatom site.B iso or equivATOM1 NN GLN AATOM2 CAC GLN AATOM3 CC GLN AATOM4 OO GLN AATOM5 CBC GLN AATOM6 NN VAL AATOM7 CAC VAL AATOM8 CC VAL AATOM9 OO VAL AATOM10 CBC VAL AATOM11 NN ALA 55.53057.32054.570

New wwPDBDeposition & Annotation SystemmmCIF/PDBxEnd-to-end support for PDBx/mmCIF

PDBx/mmCIF Software Support§ § § § § § § § § § § § Phenix and Refmac – produce native PDBx files for depositionMMDB - macromolecular object library in CCP4iotbx.cif/ucif - CCTBx C /Python IO library with dictionary validationCCIF – CCP4 C library with FORTRAN support and dictionary validationCBFLib - ANSI-C library for CIF & imgCIF filesmmLIB - Python toolkit supporting CIF & mmCIFBioPython - Python toolkit for computational biologyPyCifRW - Python CIF/mmCIF parsing toolsBioJava - Java mmCIF IO packageSTAR::Parser – Perl mmCIF parser and molecular object libraryRCSBTools - C /Python parsing and dictionary validation tools plus manyother supporting format conversion and data management applicationsVisualization - Chimera, Jmol, OpenRasMolPDB actively working with community developers to help fill inmissing functionalities. Two workshops scheduled in Fall 2013

NSF,NIGMS,DOE,NLM,NCI,NINDS,NIDDKEMBL- ‐EBI,WellcomeTrust,BBSRC,NIGMS,EUNLMwwPDBNBDC- ‐JST

mmCIF in Structural Bioinformatics . Overview ! Brief history of mmCIF development & . Biological and functional annotation ! Additional derivative data - ! Functional assemblies, validation details, coordinate frame transformations, secondary & tertiary structural features, nucleic .

Related Documents:

Structural bioinformatics adds scale and precision Structural Bioinformatics Structure Prediction Integrative Methods Molecular Simulation Structure Alignment Functional Site Comparison Docking . Lehigh University BioS 10: BioSciences in the 21st Century Brian Y. Chen Many computational fields support Structural Bioinformatics Structural

Bioinformatics Crash Course Ian Misner Ph.D. Bioinformatics Coordinator UMD Bioinformatics Core . Bioinformatics!Core The Plan Monday – Introductions – Linux and Python Hands-on Training Tuesday – NGS Introduction – RNAseq with Sailfish (Dr. Steve Mount, CBCB) – RNAse

SECTION-A: Attempt any five questions. SECTION-B: Attempt any five questions. SECTION–A Short Answer type Questions: (60-80 Words) 5 5 25 Marks 1. What is the role of internet in bioinformatics? 2. How bioinformatics assist in drug designing? 3. Write a short note on Internet Protocol (IP). 4. What is Pattern mining? 5.

volumes of biological information in bioinformatics database. They also provide some bioinformatics tools for database search and data acquire. With the explosion of sequence information available to researchers, the challenge facing bioinformatics and computational biologists is to aid in biomedical researches and to invent efficient toolkits.

tronics, Physics, Statistics, or Business Informatics. 8 LUM RAMABAJA Bachelor’s Student in Bioinformatics ‘Bioinformatics is a truly interesting field. The program has inspired me to apply what I have learned and help people by starting a company that diagnoses malaria.’ To The Point KRISTINA PREUER BSc MSc Graduate in Bioinformatics

Bioinformatics, Stellenbosch University Many bioinformatics tools and resources are available on the command-line interface These are often on the Linux platform (or other Unix-like platforms such as the Mac command line). They are essential for many bioinformatics and genomics applications.

Bioinformatics is an interdisciplinary area of the science composed of biology, mathematics and computer science. Bioinformatics is the application of information technology to manage biological data that helps in decoding plant genomes. The field of bioinformatics emerged as a tool to facilitate biological discoveries more than 10 years ago.

The grid, one of the oldest architectural design tools, is a useful device for controlling the position of building elements. Grids have been and continue to be used in all manner of layout tasks from urban design to building construction (see figure 1) . A grid can help a designer control the positions of built and space elements, making the layout task more systematic. By determining .