Data Science Tools & Methods Workshop

  • July 11–12, 2018
  • Big Ten Conference Center, Rosemont, IL

The National Data Service and the Midwest Big Data Hub are sponsoring the Data Science Tools & Methods Workshop, July 11–12, 2018, at the Big Ten Conference Center (rooms 203 and 205) in Rosemont, IL.

  • Identify cross disciplinary data science tools and methods, as well as the application of data management and analytics tools, to address research challenges.
  • Expand knowledge and access to data tools, methods, and services.
  • Seed/foster cross-disciplinary collaborations around cyberinfrastructure R+D and domain applications.
  • Demos will be focused on the use of tools.
8:00 am
9:00 am
Opening Session: Welcome and Context — Melissa Cragin and Christine Kirkpatrick
9:15 am
Panel: Tools and Platforms for Data Science
Moderator: Rajeev Bukralia, Minnesota State University, Mankato

Panel members:
10:30 am
11:00 am
Panel: Graph Analytics
Moderator: Christine Kirkpatrick, National Data Service

Panel members:
Lunch/Birds of a Feather
1:00 pm
Lightning Talks; Pilot and Synergistic Project Updates; Demos

Panel members: Demos/Projects:
  • Data Tools, Methods and Services in Action
  • Lightning Round of Projects Employing Tools and Methods
  • Brief Q&A
2:00 pm
HPC, Machine Learning, and Cyberinfrastructure: Challenges at the Intersection (Machine Learning, Deep Learning)
Aaron Saxton, NCSA/University of Illinois at Urbana-Champaign
3:00 pm
3:15 pm
Panel: Connecting with the Libraries: Engaging across the Data Ecosystem
Moderator: James Myers, University of Michigan

Panel members:
4:30 pm
Report outs, themes emerged for day
5:00 pm
8:00 am
Breakfast / Registration / Networking
9:00 am
Welcome back / Check-in and questions from the previous day — Melissa Cragin
9:15 am
Panel: Enabling Discovery through Data Stewardship
Moderator: Christine Kirkpatrick, National Data Service

Panel members:
10:30 am
10:45 am
Panel: Looking Down (Earth Science Image Data)
Kenton McHenry, NCSA/University of Illinois at Urbana-Champaign

Panel members:
Next Step Planning (summing up of assignments if applicable, such as forming interest or working groups, collaboration forming) — moderator

Closing / MBDH Events Preview — Melissa Cragin
12:30 pm
Adjourn (box lunches available)

Registration has ended for this workshop.

Big Ten Conference Center

The workshop will be held at the:

Big Ten Conference Center
5440 Park Place, Rosemont, IL 60018
Rooms 203 and 205

Parking can be found in the MB Financial Park Garage. Validation is available at the security desk to reduce the one-day price of parking to $7.50. Please have your parking ticket with you in order to validate at the security desk.

The recommended airport is Chicago O'Hare International (ORD).

The recommended hotels to stay for the workshop are:

Crowne Plaza Chicago O'Hare Crowne Plaza Chicago O'Hare
5440 North River Road
Rosemont, IL 60018

Aloft Chicago O'Hare Aloft Chicago O'Hare
9700 Balmoral Avenue
Rosemont, IL 60018

Shuttle buses are available every half hour to and from O'hare airport.

Sam Batzli, University of Wisconsin-Madison

After a stint developing databases and maps for the Army Corp of Engineers, Sam Batzli began working with GIS and satellite imagery. While at Michigan State University in the early 2000s, he began to explore web-mapping, and his interest in data visualization led him to the Space Science and Engineering Center at the University of Wisconsin-Madison where he currently manages RealEarth, a web mapping platform and set of mobile apps for the visualization of near real-time weather satellite imagery and related data. Sam holds a PhD in geography from the University of Illinois at Urbana-Champaign.

Brad Bebee, Amazon Web Services

Brad Bebee is the Principal Product Manager for Amazon Neptune, where he leads product management for AWS's newest and fully managed graph database service and works closely with customers and developers to help them build graph-enabled solutions. Prior to joining AWS, he was the CEO of Blazegraph, where he focused on leveraging products for high performance graph databases and analytics into business and mission areas and was an active open source contributor on the Blazegraph platform. He is a subject matter expert in graph and knowledge representation with experience ranging from the precursors of DARPA's DAML program to more recent work with large-scale data analytics using the Hadoop ecosystem, Accumulo, and related technologies. He has extensive experience in architecture and software modeling methodologies, where he has lead and collaborated upon multiple publications receiving recognition for his research. In 2006, he was selected as a participant in the National Academy of Engineering's U.S. Frontiers of Engineering Symposium. Over the course of his career, Brad has served as a CEO, CTO, CFO, managed operating divisions, and performed advanced technology development for commercial and public-sector customers. He holds a B.S. in Computer Science from the University of Maryland at College Park.

Ben Blaiszik, University of Chicago

Ben Blaiszik is a co-lead for the Materials Data Facility (MDF) effort, a NIST Materials Genome Initiative project, to build and deploy data services to make materials science data more easily discoverable, combinable, and re-usable regardless of data size and location. MDF has published >20 TB of curated data, and has made available via its search services ca. 250 TB of materials data. Ben co-leads the Argonne Data and Learning Hub for Science (DLHub) project that is working to collect and serve machine learning models from the scientific community and simplify the training of new models using leadership scale computing resources. He received his Ph.D. in Theoretical and Applied Mechanics, working in the field of polymer nanocomposites, from the University of Illinois at Urbana-Champaign.

Rajeev Bukralia, Minnesota State University, Mankato

Dr. Rajeev Bukralia is an assistant professor in the Computer information Science department at Minnesota State University, Mankato. He previously served as the director of data science outreach and associate provost for Information Services/CIO at UW-Green Bay, and Dean of Educational Outreach and Libraries at Black Hills State University. Rajeev's research focuses on data analytics, big data, machine learning, data governance, and IT strategy; he has a particular interest in machine learning and AI applications in cancer research and diagnostics. He earned his doctoral and master's degrees in information systems from Dakota State University.

Melissa Cragin, Midwest Big Data Hub

Melissa Cragin is Executive Director of the Midwest Big Data Hub, and based at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign (UI). Prior to joining NCSA, Melissa was Staff Associate in the Office of the Assistant Director, Directorate of Biological Sciences at the National Science Foundation (NSF), following two years there as an AAAS Science & Technology Policy Fellow. At NSF, she guided development of data policy, and accelerated community engagement on research data management and public access. Melissa also has an affiliate appointment as Assistant Professor in the School of Information Sciences at UI, where she previously led the Data Curation Education Program.

Gustavo Durand, Harvard University

Gustavo Durand is the Technical Lead and Architect of Dataverse, an open source data repository platform developed by IQSS at Harvard University. Gustavo has been on Dataverse in many roles since its inception and in his current role, leads the architecture and technical design of the platform, reviews code from internal and external developers, assists the Dataverse Community contributors, and overall works closely with the Development Project Manager.

Zachary Flamig, University of Chicago

Zachary Flamig is a postdoctoral scholar at the University of Chicago Center for Data Intensive Science. Zac has a PhD in Meteorology with a background in flash flooding and remote sensing. Zac is interested in using software built for biomedical data commons to build environmental data commons that can be used to help make better predictions on the impact of extreme events. He works with the PlanX development team on building the Gen3 Data Commons Platform software for use with the Open Commons Consortium (OCC) Environmental Data Commons. Zac also leads the scientific effort for the OCC's participation in the NOAA Big Data Project.

Ian Foster, University of Chicago

Ian Foster, Senior Fellow, is Director of the Computation Institute, a joint institute of the University of Chicago and Argonne National Laboratory. He is also an Argonne Senior Scientist and Distinguished Fellow and the Arthur Holly Compton Distinguished Service Professor of Computer Science. In his research, he seeks to develop tools and techniques that allow people to use high-performance computing technologies to do qualitatively new things. This involves investigations of parallel and distributed languages, algorithms, and communication; and also focused work on applications. He is particularly interested in using high-performance networking to incorporate remote computer and information resources into local computational environments.

Eddie Fuller, West Virginia University

Edgar Fuller has served as Professor and Chair of Mathematics at West Virginia University since 2008. He joined West Virginia University in 2002 after a postdoctoral appointment at Duke University and was promoted to professor in 2014. He currently works on applications of graph theory, topology and geometry to the study of complex networks, especially those arising as social networks. More recently has been applying these same tools to the study of networks of neurons in mammalian sensory networks. He currently is the Principal Investigator for a National Science Foundation award to study the role of anxiety and personality in the performance of students in mathematics courses and has served as PI for two US Department of Education funded Math and Science Partnerships in the state of West Virginia. He has most recently served as an American Association for the Advancement of Science (AAAS) Science and Technology Policy Fellow in the Department of Homeland Security from 2016-2018. He received his Ph.D. from the University of Georgia in the area of differential geometry and topology.

Sandra Gesing, University of Notre Dame

Sandra Gesing is a research assistant professor at the Department of Computer Science and Engineering and a computational scientist at the Center for Research Computing at the University of Notre Dame. She is a Co-PI of the IMLS-funded project "Tools and Services to Improve Preservation and Re-use of Research Data & Software" with colleagues in the Hesburgh Libraries at Notre Dame. She is also heavily involved with the Science Gateways Community Institute and a Co-PI for the conceptualization of a US Research Software Sustainability Institute. Prior to the position at Notre Dame, she was a research associate in the Data-Intensive Research Group at the University of Edinburgh, UK, in the area of data-intensive workflows and in the Applied Bioinformatics Group at the University of Tübingen, Germany, in the area of science gateways and grid computing. She received her PhD in computer science from the University of Tübingen, Germany.

Lisa Johnston, University of Minnesota

Lisa R. Johnston is the Director of the Data Repository for the University of Minnesota. Since 2016, Johnston has served as principal investigator of the Data Curation Network, a collaboration between 8 academic institutions and the Dryad Data Repository to share expert staff to curate multidisciplinary datasets ingested across partner repositories. Johnston's most recent publications include the two-volume book Curating Research Data: Vol 1 Practical Strategies for Your Digital Repository and Vol 2: A Handbook of Current Practice (ed. Johnston, 2017) and "Data Information Literacy: Librarians, Data, and the Education of a New Generation of Researchers (ed. Carlson and Johnston, 2015). She has a Masters of Library Science and Bachelors of Science in Astrophysics, both from Indiana University.

Christine Kirkpatrick, National Data Service

Christine Kirkpatrick, is the division director for IT Systems & Services at the San Diego Supercomputer Center (SDSC) at the University of California, San Diego, and the Executive Director of the NCSA-based, National Data Service (NDS), a U.S. initiative focused on how scientists and researchers across all disciplines can find, reuse, and publish data. Her research interests include research data frameworks, as well as services and interoperability challenges of creating a seamless "datanet" ecosystem. Christine has been with University of California for over 20 years; prior to this she worked in industry for several software companies. She holds a Masters from UC San Diego in Architecture-based Enterprise Systems Engineering.

Praveen Kumar, University of Illinois at Urbana Champaign

Praveen Kumar holds a B.Tech. (Indian Institute of Technology, Bombay, India 1987), M.S. (Iowa State University 1989), and Ph.D. (University of Minnesota 1993), all in civil engineering, and has been on the University of Illinois faculty since 1995. He is also an Affiliate Faculty in the Department of Atmospheric Science. His research focus is on complex hydrologic systems bridging across theory, modeling, and informatics. He presently serves as the Director of the NSF funded Critical Zone Observatory for Intensively Managed Landscapes, which is part of a national and international network. He has been an Associate of the Center for Advanced Studies, and two-times Fellow of the National Center for Supercomputing Applications. He is an AGU Fellow and a recipient of the Mahatma Gandhi Pravasi Samman (Non-Resident Honor) Award 2017 given by the NRI Welfare Society of India. He has also received the Xerox Award for Research, and Engineering Council Award for Excellence in Advising. From 2002-2008, he served as a founding Board member for CUAHSI, a consortium of over 110 universities for the advancement of hydrologic science. From 2009-2013 he served as the Editor-in-Chief of Water Resources Research, the leading journal in the field with about 500 published articles per year. Prior to that he also served as the Editor of Geophysical Research Letters, a leading journal for interdisciplinary research.

Kenton McHenry, NCSA/University of Illinois at Urbana-Champaign

Kenton McHenry is a Senior Research Scientist at the National Center for Supercomputing Applications (NCSA) at the University of Illinois, where he serves as Deputy Director for the Scientific Software and Applications division, co-leads Innovative Software and Data Analysis group (ISDA); he is an Adjunct Assistant Professor position in the Department of Computer Science. Kenton has applied his experience in computer vision, AI, and machine learning towards research and development in software cyberinfrastructure for digital preservation, auto-curation, and the providing of access to contents in large unstructured digital collections (e.g. image collections). Kenton serves as PI/Co-PI on a number of awards from a variety of agencies/organizations ranging from NSF, NIH, NEH, and private sector partners. Kenton currently serves as the Project Director and PI of NSF CIF21 DIBBs - Brown Dog where his team works on means of making data agnostic to the file formats in which they are stored and providing general purpose easy to use tools to access uncurated collections by the automatic extraction of metadata and signatures from raw file contents. He received a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign.

Leslie McIntosh, Research Data Alliance/U.S.

Dr. Leslie McIntosh is the inaugural Executive Director for the US Research Data Alliance – US (RDA-US) where she works to support global data sharing and interoperability by bolstering the work within the US and the North American region. Leveraging the knowledge of RDA, her research and work has included incorporating reproducible research into the research lifecycle through: (1) Developing the Reproducibility Assessment Tool (RepeAT)—identifying elements needed for scientific studies to be reproducible; (2) Incorporating data citation into an evolving Electronic Health Record (EHR) data repository; and, (3) Providing methods of data discoverability using freely available tools (e.g., R, Shiny) to visualize information. Dr. McIntosh served as the the Director of the CBMI at Washington University School of Medicine, and the Siteman Cancer Center Biomedical Informatics Core. She jointly founded the St. Louis Machine Learning and Data Science MeetUp (now with over 1500 members), and has developed and taught Introductory and Advanced courses in Health Data and Information Management. She holds a Master's in Public Health and a PhD in epidemiology.

Mark Musen, Stanford University

Dr. Musen is Professor of Biomedical Informatics and of Biomedical Data Science at Stanford University, where he is Director of the Stanford Center for Biomedical Informatics Research. Dr. Musen conducts research related to open science, metadata for enhanced annotation of scientific data sets, intelligent systems, reusable ontologies, and biomedical decision support. His group developed Protégé, the world's most widely used technology for building and managing terminologies and ontologies. He is principal investigator of the National Center for Biomedical Ontology, one of the original National Centers for Biomedical Computing created by the U.S. National Institutes of Health (NIH). He is principal investigator of the Center for Expanded Data Annotation and Retrieval (CEDAR). CEDAR is a center of excellence supported by the NIH Big Data to Knowledge Initiative, with the goal of developing new technology to ease the authoring and management of biomedical experimental metadata. Dr. Musen directs the World Health Organization Collaborating Center for Classification, Terminology, and Standards at Stanford University, which has developed much of the information infrastructure for the authoring and management of the 11th edition of the International Classification of Diseases (ICD-11).

James D. Myers, University of Michigan

Dr. Jim Myers has more than two decades of experience in the development and deployment of Cyberinfrastructure for research, education, and industrial application and has participated in the planning and execution of multiple large community cyberinfrastructure projects for NSF, ONR, and DOE. Dr. Myers was a co-PI on the SEAD DataNet project and a contributor to the W3C Provenance standard and is an active contributor to the Clowder and Dataverse open source communities through multiple projects. He received a B.A. in Physics from Cornell University, and his Ph.D. in Chemistry from the University of California at Berkeley.

Bin Peng, University of Illinois at Urbana-Champaign

Dr. Bin Peng is a NCSA postdoctoral fellow at the University of Illinois at Urbana-Champaign (UI) with expertise in earth system modeling, hydrological modeling, crop modeling, remote sensing, and data assimilation. Now he is working on the Blue Waters Supercomputer to develop the most advanced crop model in the framework of Community Earth System Model (CESM). He is also interested in application of satellite big data in agriculture, hydrology, and climate sciences. Before joining in NCSA at UI, Bin worked on hydrological modeling and satellite data assimilation. He received his Ph.D. from the Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences in Beijing, China.

Jonathan Petters, Virginia Tech

As Data Management Consultant and Curation Services Coordinator, Jonathan Petters provides research data management planning, training, and curation support to researchers across Virginia Polytechnic Institute and State University through the University Libraries. This position includes a supervisory role in development of the Libraries research data repository and geospatial metadata catalog. He currently participates in Research Data Alliance groups investigating data fitness and actionable data management plans. Previously Jon served in a similar data management role at Johns Hopkins University with the university's Data Management Services group. Petters was also an American Association for the Advancement of Science (AAAS) Science and Technology Policy Fellow in the U.S. Department of Energy's Office of Science, where he investigated data management policies and needs within the physical sciences.

Eunice Santos, Illinois Institute of Technology

Eunice E. Santos is the Ron Hochsprung Endowed Chair and Professor at the Illinois Institute of Technology, and Department Chair of Computer Science. Her research interests include computational social systems and social computing, parallel and distributed processing, and cybersecurity. She is a recipient of the IEEE Computer Society Technical Achievement Award (for pioneering work in Computational Social Network Systems), and many other awards. She was the Founding co-Editor-in-Chief of the IEEE Transactions on Computational Social Systems. Eunice is a Fellow of AAAS. She received her PhD in Computer Science from the University of California, Berkeley.

Aaron Saxton, NCSA/University of Illinois at Urbana-Champaign

Aaron Saxton is a Data Scientist who works in the Blue Waters project office at the National Center for Supercomputing Applications (NCSA). His current interest is in machine learning, data, and migrating popular data/ML techniques to HPC environments, and he his work includes both industry and academic ventures. Most recently he was a data scientist and founding member of the agricultural data company Agrible Inc. Prior to that, Aaron worked at Neustar Inc, University of Kentucky, and SAIC. Aaron graduated from University of Kentucky with a PhD in Mathematics.

Juliane Schneider, Harvard Catalyst

In a 20-year career specializing in metadata, ontologies and discovery, Juliane Schneider has worked in start-ups, an insurance library on Wall Street, the NYU medical center, EBSCO publishing, and at UC San Diego in the Research Data Curation Program. She spent six years at Countway Library as the Metadata Librarian, and has now returned to Harvard as the Lead Data Curator for Harvard Catalyst. Juliane is involved with Metadata 2020, and is a certified Software Carpentry instructor.

Shelley Stall, American Geophysical Union

Shelley Stall is the Director of Data Programs at the American Geophysical Union. She is currently leading an international coalition to enable FAIR data in the earth, space, and environmental sciences funded by the Laura and John Arnold Foundation. Leading scholarly publishers, scientific repositories, infrastructure, and communities are working together to move data out of the supplementary information and into a trusted, community-accepted repository where it is well-documented, discoverable, and cite-able.

Seth Juarez, Microsoft Research

Seth Juarez is a Cloud Developer Advocate at Microsoft in Redmond, WA, focusing on Artificial Intelligence, Machine Learning, and Quantum Computing. He interested in Artificial Intelligence (specifically in the realm of Machine Learning) and Quantum Computing. Seth has a Masters Degree in Computer Science from the University of Utah.

James Wilgenbusch, University of Minnesota, Twin Cities

James (Jim) Wilgenbusch is the Senior Associate Director of the Minnesota Supercomputing Institute at the University of Minnesota, Twin Cities. Jim helps define MSI High Performance Computing (HPC) and cyberinfrastructure research agendas and oversees the daily operations of the institute. Prior to MSI, Jim was a Senior Research Associate in the Department of Scientific Computing and the founding director of Florida State University's Research Computing Center. While at FSU Jim co-founded the Sunshine State Education and Research Computing Alliance (SSERCA) to bring together Florida's geographically distributed academic organizations and high-end compute and data storage resources in order to better support statewide research and create regional synergies. In addition to his management responsibilities, Jim maintains funded research activities in the study and implementation of models and search algorithms used in phylogenetic inference. He has a PhD in Biology from George Mason University in Fairfax, VA.

Craig Willis, NCSA/University of Illinois at Urbana-Champaign

Craig Willis is the Technical Coordinator for the National Data Service and Senior Research Programmer at the National Center for Supercomputing Applications (NCSA). He is involved in the development of the NDS Labs Workbench and Whole Tale platforms.

Ilya Zaslavsky, University of California San Diego

Ilya Zaslavsky is Director of Spatial Information Systems Lab at the San Diego Supercomputer Center, University of California San Diego. His work focuses on distributed information management systems, spatial and temporal data integration, and online systems for data discovery and analysis. He has been leading design and technical development in several large cyberinfrastructure projects supported by the U.S. National Science Foundation, including EarthCube CINERGI/Data Discovery Hub and CUAHSI Hydrologic Information System. Ilya received his Ph.D. from the University of Washington, and earlier a Phd. equivalent from the Russian Academy of Sciences.

Guangyu Zhao, University of Illinois at Urbana-Champaign

Guangyu Zhao is a senior research scientist in the department of atmospheric sciences at the University of Illinois at Urbana-Champaign, and is actively involved in the research of satellite remote sensing, cloud climatology, and clouds' role in the Earth's climate system. He has extensive experience in working with spacecraft remote sensing data, and since 2000, has been collaborating with the team members of the NASA MISR mission in developing, implementing and validating the cloud detection and classification algorithms used for standard processing. He was an invited member of the user working group for the NASA Atmospheric Sciences Data Center between 2012 and 2017. He is currently the technical lead of the NASA ACCESS project: ACCESS to Terra Data Fusion Products. Guangyu completed his M.S. and PhD at Illinois.

  • Hassan Abdallah
  • Saumya Agrawal
  • Brian Balderston
  • Sam Batzli
  • Brad Bebee
  • Gavin Biffar
  • Ben Blaiszik
  • Monica Boyer
  • Kevin Brandt
  • Nate Britton
  • Rajeev Bukralia
  • Jessie Chin
  • Chieh-Li Chin
  • Kevin Coakley
  • Melissa Cragin
  • Gustavo Durand
  • Christopher Fasano
  • Zac Flamig
  • Ian Foster
  • Jonathon Gaff
  • Ben Galewsky
  • Sandra Gesing
  • Peter Groves
  • Ann Impens
  • Nihali Jain
  • Mohammad Yusri Jamaluddin
  • Lisa Johnston
  • Seth Juarez
  • Anthony Juehne
  • Christine Kirkpatrick
  • Praveen Kumar
  • Nanzhu Liu
  • Teresa Manuel
  • Al McAuley
  • Garrett McComas
  • Kenton McHenry
  • Leslie McIntosh
  • Mark Musen
  • James Myers
  • Blair Parker
  • Bin Peng
  • Trevor Petersen
  • Jonathan Petters
  • Chris Reichwein
  • Eunice Santos
  • Aaron Saxton
  • Juliane Schneider
  • Resham Sharma
  • Shelley Stall
  • Christopher Szul
  • Kurt Tuohy
  • Xiaojin (Eric) Wang
  • Alainna White
  • James Wilgenbusch
  • Craig Willis
  • Khalil Yazdi
  • Ilya Zaslavsky
  • Qian Zhang
  • Guang Zhao