Research Data Overview

2y ago
4 Views
3 Downloads
4.18 MB
39 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Konnor Frawley
Transcription

Research Data Overview'A step by step guide through the research data lifecycle, data setcreation, big data vs long-tail, metadata, data centres/data repositories’Sarah enAIRE/LIBER Workshop28 May 2013, Ghent Belgium* and a lot of others, including, but not limited to: the NERC data citation andpublication project team, the PREPARDE project team and the CEDA teamVO Sandpit, November 2009

Who are we and why do wecare about data?The UK’s Natural Environment Research Council (NERC)funds six data centres which between them haveresponsibility for the long-term management of NERC'senvironmental data holdings.We deal with a variety of environmental measurements,along with the results of model simulations in: Atmospheric science Earth sciences Earth observation Marine Science Polar Science Terrestrial & freshwater science, Hydrology andBioinformaticsVO Sandpit, November 2009

The Scientific MethodA key part of the scientific method isthat it should be reproducible – otherpeople doing the same experiments inthe same way should get the sameresults.Unfortunately observational data is notreproducible (unless you have a timemachine!)The way data is organised and archivedis crucial to the reproducibility ofscience and our ability to testconclusions.This is often the only part of the processthat anyone other than the originatingscientist sees. We want to change ntific-method.phpVO Sandpit, November 2009

The research data archers are used to creating,processing and analysing data.Data repositories generally have thejob of preserving and giving access todata.Givingaccess todataAnalysingdataThird parties, or even the originalresearchers will reuse the data.PreservingdataSee http://data-archive.ac.uk/createmanage/life-cycle for more detailVO Sandpit, November 2009

What is a Dataset?DataCite’s iles/Business Models Principles v1.0.pdf):Dataset: "Recorded information, regardless ofthe form or medium on which it may berecorded including writings, films, soundrecordings, pictorial reproductions,drawings, designs, or other graphicrepresentations, procedural manuals, forms,diagrams, work flow, charts, equipmentdescriptions, data files, data processing orcomputer programs (software), statisticalrecords, and other research data." (from theU.S. National Institutes of Health (NIH)Grants Policy Statement via DataCite's BestPractice Guide for Data Citation).VO Sandpit, November 2009In my opinion a dataset issomething that is: The result of a definedprocess Scientifically meaningful Well-defined (i.e. cleardefinition of what is in thedataset and what isn’t)

Creating a dataset is hardwork!"Piled Higher and Deeper" by Jorge Chamwww.phdcomics.comVO Sandpit, November 2009

But sometimes otherpeople don’t get it."Piled Higher and Deeper" by Jorge Chamwww.phdcomics.comVO Sandpit, November 2009

Creating data: a radiopropagation datasetThe problem: rain and cloudmess up your satellite radiosignal. How can we fix this?Italsat F1: Owned andoperated by ItalianSpace Agency (ASI).Launched January1991, endedoperational lifeJanuary 2001.VO Sandpit, November 2009

The receive cabin at Sparsholt inHampshireInside the receive cabin – theinstruments my data came fromVO Sandpit, November 2009

Creating/processing dataOne day’s worth of raw data from one of thereceiversMy job was to take this.VO Sandpit, November 2009.turn it into this.

Analysing data a process which involved 4major steps, 4 differentcomputer programmes, and16 intermediate files for eachday of measurements.Each month of preproccesseddata represented somewherebetween a couple of days anda week's worth of effort.It was a job where attention todetail was important, and youreally had to know what youwere looking at from ascientific perspective.with the final result being this.VO Sandpit, November 2009

Preserving data (the wrong way!)Part of the Italsat data archive – on CDsin a shelf in my officeVO Sandpit, November 2009

What the processed dataset looks like on diskWhat the raw data fileslooked like.(I do have some Worddocuments somewherewhich describe what allthis is )VO Sandpit, November 2009

Example documentationNote thesoftwarefilenames in thedocumentation.I still have theIDL files on disksomewhere, butI’d be verysurprised ifthey’re stillcompatible withthe currentversion of IDLVO Sandpit, November 2009

Documentation can sometimesproduce mixed feelings"Piled Higher and Deeper" by Jorge Chamwww.phdcomics.comVO Sandpit, November 2009

What it all came down to:Composite image from Flickr user bnilsen and Matt Stempeck (NOI), sharedunder Creative Commons licenseAnd I wasn’t even preserving my data properly!VO Sandpit, November 2009

As for giving access to the data I did share, but there was a lot of non-disclosure agreements (I am not a lawyer!)And I didn’t feel like I got the credit for it.(The first publication based on the data wasn’twritten by me, and I didn’t even get my name in the acknowledgements.)VO Sandpit, November 2009

Good news: thedata is all on theBADC nowVO Sandpit, November 2009

Another example: How is myscarf like a dataset? The raw material it’s made from doesn’tcontain informationBut the act of knitting encodes information intothe scarfThe scarf is the result of a well definedprocess (knitting) and has a particular methodused to create itI need to be able to describe itI need to be able to find itI need to store it properly so it doesn't get lost,or corrupted (i.e. eaten by moths or shreddedby mice)I might need to recreate it so I need to keepinformation about itI put a lot of time and effort into making it, soI’m very attached to it!VO Sandpit, November 2009

Just like not allscarves are thesame, not alldatasets are 3251690074/http://www.flickr.com/photos/maco squed/8084145976/If in doubt, ask the 282305884/VO Sandpit, November 028/

MetadataIt is generally agreed that we need methods to: define and document datasets of importance. augment and/or annotate data amalgamate, reprocess and reuse dataTo do this, we need metadata – dataabout dataFor example:Longitude and latitude are metadata about theplanet. They are artificial They allow us to communicate about places ona sphere They were principally designed by those whoneeded to navigate the oceans, which arelacking in visible features!VO Sandpit, November 2009http://www.kcoyle.net/meta purpose.htmlMetadata can often act as asurrogate for the real thing, inthis case the planet.

Metadata for my scarf Dataset views and suggested usesDescriptive: “teal blue”, “scarf”Dimensions: 200cm long, 20cm wideLocation: “Around my neck”/”Hanging onthe door of my wardrobe”Identifier: KOI (knitted object identifier)Information needed to recreate it: The raw material: King Cole Haze GlitterDK, colourway 124 - Ocean, with dyelot67233 Needle size: 4mm Algorithm used to create it: 18 stitch featherand fan stitch with 2 stitch garter stitchborder at the edges Number of stitches cast on: 54 Tension (how tightly I knit in this pattern):28 rows and 27 stitches for a 10cm by10cm squareVO Sandpit, November 2009

Metadata for Discovery, Documentation,DefinitionLawrence et al 2009, doi:10.1098/rsta.2008.0237VO Sandpit, November 2009

MOLES: Metadata Objects for LinkingEnvironmental Sciences es/V3.4/MODEL/Diagrams/MOLES3.4Summary.pngVO Sandpit, November 2009

What do data centres do?Data Curation Lifecycle ModelThe Digital Curation Centre’sCuration Lifecycle Modelprovides a graphical, high-leveloverview of the stages requiredfor successful curation andpreservation of data from initialconceptualisation or receiptthrough the iterative ion-lifecycle-modelVO Sandpit, November 2009

Data repositoryworkflows Workflows arevery varied! No onesize fits all method Can have multipleworkflows in thesame data centre,depending oninteractions withexternal sources(“Engagedsubmitter”/ “Datadumper” / “Thirdparty requester”)VO Sandpit, November 2009

Why should I bother puttingmy data into a repository?"Piled Higher and Deeper" by Jorge Chamwww.phdcomics.comVO Sandpit, November 2009

It’s ok, I’ll just do regular backupsPhaistos Disk, 1700BCThese documents have been preserved for thousands of years!But they’ve both been translated many times, with different meanings each time.Data Preservation is not enough, we need Active Curation to preserveInformationVO Sandpit, November 2009

VO Sandpit, November 2009

Example Big Data: CMIP5CMIP5: Fifth Coupled ModelIntercomparison Project Global community activity under theWorld Meteorological Organisation(WMO) via the World Climate ResearchProgramme (WCRP) Aim:– to address outstanding scientificquestions that arose as part ofthe 4th Assessment Reportprocess,– improve understanding ofclimate, and– to provide estimates of futureclimate change that will be usefulto those considering its possibleconsequences.Take home points here:Many distinct experiments, with verydifferent characteristics, which influence theconfiguration of the models, (what they cando, and how they should be interpreted).VO Sandpit, November 2009

FAR:1990SAR:1995TAR:2001AR4:2007AR5:2013VO Sandpit, November 2009

CMIP5 numbers!Simulations: 90,000 years 60 experiments 20 modelling centres (from aroundthe world) using 30 major(*) model configurations 2 million output “atomic” datasets 10's of petabytes of output 2 petabytes of CMIP5 requestedoutput 1 petabyte of CMIP5 “replicated”outputWhich are replicated at a number ofsites (including ours)Of the replicants: 220 TB decadal 540 TB long term 220 TB atmosphere-only 80 TB of 3hourly data 215 TB of ocean 3d monthly data 250 TB for the cloud feedbacks 10 TB of land-biochemistry (fromthe long term experiments alone)VO Sandpit, November 2009

Handling the CMIP5 data Major internationalcollaboration!Funded by EU FP7projects (IS-ENES,Metafor) and US(ESG) and othernational sources (e.g.NERC for the UK)http://esgf-index1.ceda.ac.uk/esgf-web-fe/VO Sandpit, November 200933

Summary of the CMIP5 exampleThe Climate problem needs:– Major physical e-infrastructure (networks, supercomputers)– Comprehensive information architectures covering the whole information lifecycle, including annotation (particularly of quality) and hard work populating these information objects, particularly withprovenance detail.– Sophisticated tools to produce and consume the data and informationobjects– State of the art access control techniquesMajor distributed systems are social challenges as much as technical challenges.CMIP5 is Big Data, with lots of different participants and lots of differenttechnologies. It also has a community willing to work together to standardiseand automate data and metadata production and curation.VO Sandpit, November 200934

http://www.flickr.com/photos/zlatko/5975700417/Big Data: Industrialised and standardised dataand metadata production Large groups of people involved Methods for attribution and credit fordata creation establishedLong Tail Data: Bespoke data and metadata creationmethods Small groups/lone researchers No generally accepted methods forattribution and credit for data creationVO Sandpit, November 2009

Future role of the libraryDomain specific repositories can: Pick and choose what data to keep Ask for (and get) more detailed metadata Provide specific tools and services(visualisations, server-side processing, ) Deal with Big Data!Libraries will need to: Pick up and manage/archive the long-taildata where there isn’t a domain repository Have generalised, widely applicablesystems that can cope with subjects fromastronomy to zoology Be prepared to cope with anything!VO Sandpit, November 2009

Don’t Panic!There’s a lot of information out thereabout managing data.Some of it won’t suit what you’retrying to do, but some will.Learn from others’ experiences good and bad!Good luck!VO Sandpit, November 2009

Summary and maybe conclusions? Data is important, and becoming moreso for a far wider range of thepopulation Conclusions and knowledge are onlyas good as the data they’re based on Science is supposed to bereproducible and verifiable It’s up to us as scientists to care forthe data we’ve got and ensure that thestory of what we did to the data istransparent So we can use the data again And so people will trust our results It’s not an easy job – but someone’sgot to do it!VO Sandpit, November 2009

Thanks!Any questions?sarah.callaghan@stfc.ac.uk@sorcha nihttp://citingbytes.blogspot.co.uk/Image credit: Borepatch you-dont-know-that-hurts.htmlVO Sandpit, November 2009

Algorithm used to create it: 18 stitch feather and fan stitch with 2 stitch garter stitch border at the edges Number of stitches cast on: 54 Tension (how tightly I knit in this pattern): 28 rows and 27 stitches for a 10cm by 10cm square . Dataset views and suggested uses

Related Documents:

neric Data Modeling and Data Model Patterns in order to build data models for crime data which allows complete and consistent integration of crime data in Data Warehouses. Keywords-Relational Data Modeling; Data Warehouse; Generic Data Modeling; Police Data, Data Model Pattern existing data sets as well as new kinds of data I. INTRODUCTION The research about Business Intelligence and Data

Title: ER/Studio Data Architect 8.5.3 Evaluation Guide, 2nd Edition Author: Embarcadero Technologies, Inc. Keywords: CA ERwin data model software Data Modeler data modeler tools data modelers data modeling data modeling software data modeling tool data modeling tools data modeling with erwin data modelings data modeller data modelling software data modelling tool data modelling

3 TABLE OF CONTENTS 1. EXO Platform Overview 1.1 EXO1 Sonde Overview 1.2 EXO2 Sonde Overview 1.3 EXO2S Sonde Overview 1.4 EXO3 Sonde Overview 1.5 EXO Field Cables Overview 1.6 EXO Handheld Overview 1.7 EXO GO Overview 2. Operation 2.1 Sonde Install / Replace EXO1 Batteries 2.2 Sonde Install / Replace EXO2 and EXO3 Batteries 2.3 Install / Remove Guard or Cal. Cup 2.4

Step 4: Developing a Research Plan In a traditional educational research study, the development of a research design and plan for collecting data is known as the research methodology. Inherent in designing an action research study are several specific decisions that must be made during this step in the action research process. Once the research

Data quality attributes 6. Data Stewardship (accepting responsibility for the data)for the data) 7. Metadata Management (managing the data about the data)about the data) 8. Data Usage (putting the data to work) 9. Data Currency (getting the data at the right time) 10. Education (teaching everyone about their role in data quality) 24

KS3 History curriculum overview 5 KS3 Latin and Class Civ at Chesterton 6 KS3 Maths at Chesterton 8 KS3 MFL curriculum overview 11 KS3 Music at Chesterton 12 KS3 PE Boys curriculum overview 13 KS3 PE Girls curriculum overview 14 KS3 RS curriculum overview 15 KS3 Science curriculum overview 16 .

University of Bradford, School of Management Introduction to Research Effective Learning Service 1 This workbook is a short introduction to research and research methods and will outline some, but not all, key areas of research and research methods: ¾ Definitions ¾ Research approaches ¾ Stages of the research process

research process, the role of research, research concepts, and research evaluation. 1.2 Research as a process Research can be seen as a series of linked activities moving from a beginning to an end. Research usually begins with the identification of a problem followed by formulation of research questions or objectives.