XSEDE: Python Tools for Data Science

By |


Python has become a very popular programming language and software ecosystem for work in Data Science, integrating support for data access, data processing, modeling, machine learning, and visualization. In this webinar, we will describe some of the key Python packages that have been developed to support that work, and highlight some of their capabilities. This webinar will also serve as an introduction and overview of topics addressed in two Cornell Virtual Workshop tutorials, available at https://cvw.cac.cornell.edu/pydatasci1 and https://cvw.cac.cornell.edu/pydatasci2 .

See https://portal.xsede.org/course-calendar/-/training-user/class/2467/session/4161 for more information and registration


Register via the XSEDE Portal:

If you do not currently have an XSEDE Portal account, you will need to create one:


Should you have any problems with that process, please contact help@xsede.org and they will provide assistance.


XSEDE HPC HPC Summer Boot Camp

By |


XSEDE, along with the Pittsburgh Supercomputing Center is pleased to present a Hybrid Computing workshop.

This 4 day event will include MPI, OpenMP, GPU programming using OpenACC and accelerators.

This workshop will be remote to desktop only due to the COVID-19 pandemic.  When the registration has filled, there will be no more students added due to our current limits.

The schedule can be found here:  https://www.psc.edu/resources/training/xsede-hpc-workshop-june-8-11-2021-summer-boot-camp/


Register via the XSEDE Portal:


If you do not currently have an XSEDE Portal account, you will need to create one:


Should you have any problems with that process, please contact help@xsede.org and they will provide assistance.


Please address any questions to Tom Maiden at tmaiden@psc.edu.

GIS Fundamentals – Spatial Database, PostGIS

By |

PostGIS, built on top of PostgreSQL, is the most powerful open-source relational database for managing spatial data. In this workshop we will cover the basic concept of spatial databases, learn about setting PostGIS, and understand how PostGIS can help us manage large volumes of vector data spread over multiple tables and geometries efficiently.  We will also touch upon topics such as spatial indexing and the capabilities of PostGIS for other data models for 2-D GIS such as the network and raster data model.

GIS Fundamentals – V (Spatial Database – PostGIS)

By |

This is the fifth workshop in a series of workshops we are offering this semester on the fundamentals of GIS. Each workshop covers one or two key elements of GIS and is somewhat self-contained. The focus is on conceptual details that can provide sufficient preparation for applications, but we will also touch upon the technical aspects.

In this workshop we will cover the basic concepts of spatial databases and learn about setting up and using PostGIS, an open source spatial database built on top of PostgreSQL, along with R for vector data analysis. We will also touch upon topics such as spatial indexing, query processing and the capabilities of PostGIS for other data models such as the network and raster data model. This is a hands-on workshop and the instructor will use a Mac machine. If you intend to use a Windows or Linux machine please get in touch with the instructor before the workshop at manishve@umich.edu.

CoreLogic property data

By |
The University of Michigan library system has licensed a large data set containing real estate transactions, deeds, and property tax records for the United States.  The data were collected by the commercial vendor CoreLogic, and our license allows UM researchers to use the data for research purposes.  These data are of potential interest to researchers in many fields, as they capture spatial and temporal real estate market conditions, taxing practices, and the physical states of millions of residential structures in the US.
In this workshop, participants will learn to create geographical subsets of the data, seamlessly integrate it in workflow, and see examples of research questions where the data can be useful. Participants should know Python and R.

Balzano wins NSF CAREER award for research on machine learning and big data involving physical, biological and social phenomena

By | General Interest, Happenings, News, Research

Prof. Laura Balzano received an NSF CAREER award to support research that aims to improve the use of machine learning in big data problems involving elaborate physical, biological, and social phenomena. The project, called “Robust, Interpretable, and Efficient Unsupervised Learning with K-set Clustering,” is expected to have broad applicability in data science.

Modern machine learning techniques aim to design models and algorithms that allow computers to learn efficiently from vast amounts of previously unexplored data, says Balzano. Typically the data is broken down in one of two ways. Dimensionality-reduction uses an algorithm to break down high-dimensional data into low-dimensional structure that is most relevant to the problem being solved. Clustering, on the other hand, attempts to group pieces of data into meaningful clusters of information.

However, explains Balzano, “as increasingly higher-dimensional data are collected about progressively more elaborate physical, biological, and social phenomena, algorithms that aim at both dimensionality reduction and clustering are often highly applicable, yet hard to find.”

Balzano plans to develop techniques that combine the two key approaches used in machine learning to decipher data, while being applicable to data that is considered “messy.” Messy data is data that has missing elements, may be somewhat corrupted, or is filled heterogeneous information – in other words, it describes most data sets in today’s world.

Balzano is an affiliated faculty member of both the Michigan Institute for Data Science (MIDAS) and the Michigan Institute for Computational Discovery and Engineering (MICDE). She is part of a MIDAS-supported research team working on single-cell genomic data analysis.

Read more about the NSF CAREER award…

U-M partners with Cavium on Big Data computing platform

By | Feature, General Interest, Happenings, HPC, News

A new partnership between the University of Michigan and Cavium Inc., a San Jose-based provider of semiconductor products, will create a powerful new Big Data computing cluster available to all U-M researchers.

The $3.5 million ThunderX computing cluster will enable U-M researchers to, for example, process massive amounts of data generated by remote sensors in distributed manufacturing environments, or by test fleets of automated and connected vehicles.

The cluster will run the Hortonworks Data Platform providing Spark, Hadoop MapReduce and other tools for large-scale data processing.

“U-M scientists are conducting groundbreaking research in Big Data already, in areas like connected and automated transportation, learning analytics, precision medicine and social science. This partnership with Cavium will accelerate the pace of data-driven research and opening up new avenues of inquiry,” said Eric Michielssen, U-M associate vice president for advanced research computing and the Louise Ganiard Johnson Professor of Engineering in the Department of Electrical Engineering and Computer Science.

“I know from experience that U-M researchers are capable of amazing discoveries. Cavium is honored to help break new ground in Big Data research at one of the top universities in the world,” said Cavium founder and CEO Syed Ali, who received a master of science in electrical engineering from U-M in 1981.

Cavium Inc. is a leading provider of semiconductor products that enable secure and intelligent processing for enterprise, data center, wired and wireless networking. The new U-M system will use dual socket servers powered by Cavium’s ThunderX ARMv8-A workload optimized processors.

The ThunderX product family is Cavium’s 64-bit ARMv8-A server processor for next generation Data Center and Cloud applications, and features high performance custom cores, single and dual socket configurations, high memory bandwidth and large memory capacity.

Alec Gallimore, the Robert J. Vlasic Dean of Engineering at U-M, said the Cavium partnership represents a milestone in the development of the College of Engineering and the university.

“It is clear that the ability to rapidly gain insights into vast amounts of data is key to the next wave of engineering and science breakthroughs. Without a doubt, the Cavium platform will allow our faculty and researchers to harness the power of Big Data, both in the classroom and in their research,” said Gallimore, who is also the Richard F. and Eleanor A. Towner Professor, an Arthur F. Thurnau Professor, and a professor both of aerospace engineering and of applied physics.

Along with applications in fields like manufacturing and transportation, the platform will enable researchers in the social, health and information sciences to more easily mine large, structured and unstructured datasets. This will eventually allow, for example, researchers to discover correlations between health outcomes and disease outbreaks with information derived from socioeconomic, geospatial and environmental data streams.

U-M and Cavium chose to run the cluster on Hortonworks Data Platform, which is based on open source Apache Hadoop. The ThunderX cluster will deliver high performance computer services for the Hadoop analytics and, ultimately, a total of three petabytes of storage space.

“Hortonworks is excited to be a part of forward-leading research at the University of Michigan exploring low-powered, high-performance computing,” said Nadeem Asghar, vice president and global head of technical alliances at Hortonworks. “We see this as a great opportunity to further expand the platform and segment enablement for Hortonworks and the ARM community.”

Workshop co-chaired by MIDAS co-director Prof. Hero releases proceedings on inference in big data

By | Al Hero, Educational, General Interest, Research

The National Academies Committee on Applied and Theoretical Statistics has released proceedings from its June 2016 workshop titled “Refining the Concept of Scientific Inference When Working with Big Data,” co-chaired by Alfred Hero, MIDAS co-director and the John H Holland Distinguished University Professor of Electrical Engineering and Computer Science.

The report can be downloaded from the National Academies website.

The workshop explored four key issues in scientific inference:

  • Inference about causal discoveries driven by large observational data
  • Inference about discoveries from data on large networks
  • Inference about discoveries based on integration of diverse datasets
  • Inference when regularization is used to simplify fitting of high-dimensional models.

The workshop brought together statisticians, data scientists and domain researchers from different biomedical disciplines in order to identify new methodological developments that hold significant promise, and to highlight potential research areas for the future. It was partially funded by the National Institutes of Health Big Data to Knowledge Program, and the National Science Foundation Division of Mathematical Sciences.

Big Data: Improving the Scope, Quality and Accessibility of Financial Data

By |

The Office of Financial Research and the University of Michigan will host a joint conference, “Big Data: Improving the Scope, Quality, and Accessibility of Financial Data” in Ann Arbor, Michigan.  The conference will bring together a wide range of scholars, regulators, policymakers, and practitioners to explore how Big Data can be used to enhance financial stability and address other challenges in financial markets.