Friday, September 30, 2016

asyncpg

"A database interface library designed specifically for PostgreSQL and Python/asyncio. asyncpg is an efficient, clean implementation of PostgreSQL server binary protocol for use with Python's asyncio framework."

https://github.com/MagicStack/asyncpg

https://magic.io/blog/asyncpg-1m-rows-from-postgres-to-python/

https://magicstack.github.io/asyncpg/current/

Thursday, September 29, 2016

Devito

"Devito is a new tool for performing optimised Finite Difference (FD) computation from high-level symbolic problem definitions. Devito performs automated code generation and Just-In-time (JIT) compilation based on symbolic equations defined in SymPy to create and execute highly optimised Finite Difference kernels on multiple computer platforms."

https://github.com/opesci/devito

Devito: automated fast finite difference computation - http://hgpu.org/?p=16561

OgmaNeo

"Efforts at understanding the computational processes in the brain have met with limited success, despite their importance and potential uses in building intelligent machines. We propose a simple new model which draws on recent findings in Neuroscience and the Applied Mathematics of interacting Dynamical Systems. The Feynman Machine is a Universal Computer for Dynamical Systems, analogous to the Turing Machine for symbolic computing, but with several important differences. We demonstrate that networks and hierarchies of simple interacting Dynamical Systems, each adaptively learning to forecast its evolution, are capable of automatically building sensorimotor models of the external and internal world. We identify such networks in mammalian neocortex, and show how existing theories of cortical computation combine with our model to explain the power and flexibility of mammalian intelligence. These findings lead directly to new architectures for machine intelligence. A suite of software implementations has been built based on these principles, and applied to a number of spatiotemporal learning tasks."

https://arxiv.org/abs/1609.03971v1

https://github.com/ogmacorp

Wednesday, September 28, 2016

dask.distributed

"Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

Distributed serves to complement the existing PyData analysis stack. In particular it meets the following needs:
  • Low latency: Each task suffers about 1ms of overhead. A small computation and network roundtrip can complete in less than 10ms.
  • Peer-to-peer data sharing: Workers communicate with each other to share data. This removes central bottlenecks for data transfer.
  • Complex Scheduling: Supports complex workflows (not just map/filter/reduce) which are necessary for sophisticated algorithms used in nd-arrays, machine learning, image processing, and statistics.
  • Pure Python: Built in Python using well-known technologies. This eases installation, improves efficiency (for Python users), and simplifies debugging.
  • Data Locality: Scheduling algorithms cleverly execute computations where data lives. This minimizes network traffic and improves efficiency.
  • Familiar APIs: Compatible with the concurrent.futures API in the Python standard library. Compatible with dask API for parallel algorithms
  • Easy Setup: As a Pure Python package distributed is pip installable and easy to set up on your own cluster.
Dask.distributed is a centrally managed, distributed, dynamic task scheduler. The central dask-scheduler process coordinates the actions of several dask-worker processes spread across multiple machines and the concurrent requests of several clients.

The scheduler is asynchronous and event driven, simultaneously responding to requests for computation from multiple clients and tracking the progress of multiple workers. The event-driven and asynchronous nature makes it flexible to concurrently handle a variety of workloads coming from multiple users at the same time while also handling a fluid worker population with failures and additions. Workers communicate amongst each other for bulk data transfer over TCP.

Internally the scheduler tracks all work as a constantly changing directed acyclic graph of tasks. A task is a Python function operating on Python objects, which can be the results of other tasks. This graph of tasks grows as users submit more computations, fills out as workers complete tasks, and shrinks as users leave or become disinterested in previous results.

Users interact by connecting a local Python session to the scheduler and submitting work, either by individual calls to the simple interface client.submit(function, *args, **kwargs) or by using the large data collections and parallel algorithms of the parent dask library. The collections in the dask library like dask.array and dask.dataframe provide easy access to sophisticated algorithms and familiar APIs like NumPy and Pandas, while the simple client.submit interface provides users with custom control when they want to break out of canned “big data” abstractions and submit fully custom workloads."

http://distributed.readthedocs.io/en/latest/

https://matthewrocklin.com/blog//work/2016/09/22/cluster-deployments

teleport

"Gravitational Teleport is a modern SSH server for remotely accessing clusters of Linux servers via SSH or HTTPS. It is intended to be used instead of sshd.  Teleport enables teams to easily adopt the best SSH practices like:
  • No need to distribute keys: Teleport uses certificate-based access with automatic expiration time.
  • Enforcement of 2nd factor authentication.
  • Cluster introspection: every Teleport node becomes a part of a cluster and is visible on the Web UI.
  • Record and replay SSH sessions for knowledge sharing and auditing purposes.
  • Collaboratively troubleshoot issues through session sharing.
  • Connect to clusters located behind firewalls without direct Internet access via SSH bastions.
  • Ability to integrate SSH credentials with your organization identities via OAuth (Google Apps, Github).
Teleport is built on top of the high-quality Golang SSH implementation and it is fully compatible with OpenSSH."

https://github.com/gravitational/teleport

https://github.com/gravitational/teleconsole

Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen

"This post presents performance benchmarks for general matrix-matrix multiplication between Mir GLAS, OpenBLAS, Eigen, and two closed source BLAS implementations from Intel and Apple.

OpenBLAS is the default BLAS implementation for most numeric and scientific projects, for example the Julia Programing Language and NumPy. The OpenBLAS Haswell computation kernels were written in assembler.

Mir is an LLVM-Accelerated Generic Numerical Library for Science and Machine Learning. It requires LDC (LLVM D Compiler) for compilation. Mir GLAS (Generic Linear Algebra Subprograms) has a single generic kernel for all CPU targets, all floating point types, and all complex types. It is written completely in D, without any assembler blocks. In addition, Mir GLAS Level 3 kernels are not unrolled and produce tiny binary code, so they put less pressure on the instruction cache in large applications."

http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/glas-gemm-benchmark.html

https://github.com/libmir/mir

Tuesday, September 20, 2016

ssd2gpu

"A kernel module to support SSD-to-GPU direct DMA.

NVMe-Strom is a Linux kernel module which provides the SSD-to-GPU direct DMA. It allows to (1) map a particular GPU device memory on PCI BAR memory area, and (2) launch P2P DMA from the source file blocks to the mapped GPU device memory without intermediation by the main memory.

Requirements
  • NVIDIA Tesla or Quadro GPU
  • NVMe SSD
  • Red Hat Enterprise Linux 7.x, or compatible kernel
  • Ext4 or XFS filesystem on the raw block device (Any RAID should not be constructed on the device)"
 https://github.com/kaigai/ssd2gpu

http://kaigai.hatenablog.com/entry/2016/09/08/003556

Thursday, September 15, 2016

PSyclone

"PSyclone is a code generation and optimisation environment for the GungHo PSy layer.

GungHo proposes to separate model code into 3 layers, the algorithm layer, the Psy layer and the kernel layer. This approach is called psykal (for PSY, Kernel, ALgorithm). The idea behind psykal is to separate science code, which should be invariant computational resources, from code optimisations, which are often machine specific. The hope is that this separation will lead to understandable and maintainable code whilst providing performance portability as code can be optimised for the required architecture(s).

The Algorithm layer implements a codes algorithm at a relatively high level in terms of logically global fields, control structures and calls to kernel routines.

 The Kernel layer implements the underlying science. Kernel operate on a subset of a field, typically a column, or set of columns. The Kernel operates on raw (fortran) arrays; this is primarily for performance reasons.

The PSy layer sits in-between the Algorithm layer and the Kernel layer. It's functional responsibilities are to
  • map from the global view of the algorithm layer to the field-subset view of the kernel layer by iterating over the appropriate space (typically mesh cells in a finite element implementation).
  • map between the high level global field view of data at the algorithm layer and the low level local (fortran) array view of data at the kernel layer.
  • provide any additional required arguments to the Kernel layer, such as dofmaps and quadrature values.
  • add appropriate halo calls, reduction variables and global sums to ensure correct operation of the parallel code. 

The PSy layer is also where any single node performance optimisations take place, such as OpenMP parallelisation for many-core architectures or OpenACC parallelisation GPU's. Please note that, the internode, distributed memory partitioning (typically MPI with domain decomposition) is taken care of separately.

PSyclone is a tool that generates PSy layer code. This is achieved by parsing the algorithm layer code to determine the order in which kernels are called and parsing metadata about the kernels themselves. In addition to generating correct PSy code, PSyclone offers a set of optimising transformations which can be used to optimise the performance of the PSy layer."

https://puma.nerc.ac.uk/trac/GungHo/wiki/PSyclone

https://puma.nerc.ac.uk/trac/GungHo

https://github.com/stfc/PSyclone

 Unique code-generating software makes weather and climate forecasting easier -
https://www.scientific-computing.com/news/unique-code-generating-software-makes-weather-and-climate-forecasting-easier

PiBakery

"The key feature of PiBakery is the ability to create a customised version of Raspbian that you write directly to your Raspberry Pi. This works by creating a set of scripts that run when the Raspberry Pi has been powered on, meaning that your Pi can automatically perform setup tasks, and you don't need to configure anything.

The scripts are created using a block based interface that is very similar to Scratch. If you've used Scratch before, you already know how to use PiBakery. Simply drag and drop the different tasks that you want your Raspberry Pi to perform, and they'll be turned into scripts and written to your SD card. As soon as the Pi boots up, the scripts will be run."

http://www.pibakery.org/

F-Droid

"F-Droid is an installable catalogue of FOSS (Free and Open Source Software) applications for the Android platform. The client makes it easy to browse, install, and keep track of updates on your device."

https://f-droid.org/

monetdb

"When your database grows into millions of records spread over many tables and business intelligence/ science becomes the prevalent application domain, a column-store database management system is called for. Unlike traditional row-stores, such as MySQL and PostgreSQL, a column-store provides a modern and scalable solution without calling for substantial hardware investments.

MonetDB  pioneered column-store solutions for high-performance data warehouses for business intelligence and eScience since 1993. It achieves its goal by innovations at all layers of a DBMS, e.g. a storage model based on vertical fragmentation, a modern CPU-tuned query execution architecture, automatic and adaptive indices, run-time query optimization, and a modular software architecture. It is based on the SQL 2003 standard with full support for foreign keys, joins, views, triggers, and stored procedures. It is fully ACID compliant and supports a rich spectrum of programming interfaces (JDBC, ODBC, PHP, Python, RoR, C/C++, Perl).

MonetDB is the focus of database research pushing the technology envelop in many areas. Its three-level software stack, comprised of SQL front-end, tactical-optimizers, and columnar abstract-machine kernel, provide a flexible environment to customize it many different ways. A rich collection of linked-in libraries provide functionality for temporal data types, math routine, strings, and URLs. In-depth information on the technical innovations in the design and implementation of MonetDB can be found in our science library. "

https://www.monetdb.org/Home

htsql

"HTSQL is designed for data analysts and other accidental programmers who have complex business inquiries to solve and need a productive tool to write and share database queries.

HTSQL is a complete query language featuring automated linking, aggregation, projections, filters, macros, a compositional syntax, and a full set of data types & functions.

HTSQL is a web service that accepts queries as URLs, returning results formatted as HTML, JSON, CSV or XML. With HTSQL, databases can be accessed, secured, cached, and integrated using standard web technologies.

HTSQL requests are translated to efficient SQL queries. HTSQL supports different SQL dialects including SQLite, PostgreSQL, MySQL, Oracle, and Microsoft SQL Server."

http://htsql.org/

BlazingDB

"BlazingDB is an extremely fast SQL database able to handle petabyte scale.  BlazingDB heavily uses specialized, massively parallel co-processors, specifically graphics processors (GPUs).  Blazing is a data science platform, that enables our users to run very large processes and jobs through Python, R, and SQL on super-charged GPU servers."

http://blazingdb.com/

Tuesday, September 13, 2016

FPGA Bibliography

Survey of Domain-Specific Languages for FPGA Computing - http://hgpu.org/?p=16212

"High-performance FPGA programming has typically been the exclusive domain of a small band of specialized hardware developers. They are capable of reasoning about implementation concerns at the register-transfer level (RTL) which is analogous to assembly-level programming in software. Sometimes these developers are required to push further down to manage even lower levels of abstraction closer to physical aspects of the design such as detailed layout to meet critical design constraints. In contrast, software programmers have long since moved away from textual assembly-level programming towards relying on graphical integrated development environments (IDEs), highlevel compilers, smart static analysis tools and runtime systems that optimize, manage and assist the program development tasks. Domain-specific languages (DSLs) can bridge this productivity gap by providing higher levels of abstraction in environments close to the domain of application expert. DSLs carefully limit the set of programming constructs to minimize programmer mistakes while also enabling a rich set of domain-specific optimizations and program transformations. With a large number of DSLs to choose from, an inexperienced FPGA user may be confused about how to select an appropriate one for the intended domain. In this paper, we review a combination of legacy and state-ofthe-art DSLs available for FPGA development and provide a taxonomy and classification to guide selection and correct use of the framework."

 A Survey of FPGA Based Neural Network Accelerator - https://hgpu.org/?p=17900

Recent researches on neural network have shown great advantage in computer vision over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the great computation and storage complexity of neural network based algorithms poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA based neural network accelerator is becoming a research topic. Because specific designed hardware is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network accelerators based on FPGA and summarize the main techniques used. Investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA based neural network accelerator design and serves as a guide to future work.

Bibliography of GPU Overviews

A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures - https://www.cambridge.org/core/journals/communications-in-computational-physics/article/survey-on-parallel-computing-and-its-applications-in-dataparallel-problems-using-gpu-architectures/879D964A36478175DEED99FB00C8D811

Parallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions. The evolution of computer architectures (multi-core and many-core) towards a higher number of cores can only confirm that parallelism is the method of choice for speeding up an algorithm. In the last decade, the graphics processing unit, or GPU, has gained an important place in the field of high performance computing (HPC) because of its low cost and massive parallel processing power. Super-computing has become, for the first time, available to anyone at the price of a desktop computer. In this paper, we survey the concept of parallel computing and especially GPU computing. Achieving efficient parallel algorithms for the GPU is not a trivial task, there are several technical restrictions that must be satisfied in order to achieve the expected performance. Some of these limitations are consequences of the underlying architecture of the GPU and the theoretical models behind it. Our goal is to present a set of theoretical and technical concepts that are often required to understand the GPU and its massive parallelism model. In particular, we show how this new technology can help the field of computational physics, especially when the problem is data-parallel. We present four examples of computational physics problems; n-body, collision detection, Potts model and cellular automata simulations. These examples well represent the kind of problems that are suitable for GPU computing. By understanding the GPU architecture and its massive parallelism programming model, one can overcome many of the technical limitations found along the way, design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.


Scientific Computing Using Consumer Video-Gaming Hardware Devices - http://hgpu.org/?p=16277

"Commodity video-gaming hardware (consoles, graphics cards, tablets, etc.) performance has been advancing at a rapid pace owing to strong consumer demand and stiff market competition. Gaming hardware devices are currently amongst the most powerful and cost-effective computational technologies available in quantity. In this article, we evaluate a sample of current generation video-gaming hardware devices for scientific computing and compare their performance with specialized supercomputing general purpose graphics processing units (GPGPUs). We use the OpenCL SHOC benchmark suite, which is a measure of the performance of compute hardware on various different scientific application kernels, and also a popular public distributed computing application, Einstein@Home in the field of gravitational physics for the purposes of this evaluation."

GPU-accelerated algorithms for many-particle continuous-time quantum walks - http://www.sciencedirect.com/science/article/pii/S0010465517300668

"On the other hand, the evolution of computer architectures towards multicore processors even in stand-alone workstations enabled important cuts of the execution time by introducing the possibility of running multiple threads in parallel and spreading the workload among cores. This possibility was boosted up by the general purpose parallel computing architectures of modern graphic cards (GPGPUs). In the latter, hundreds or thousands of computational cores in the same single chip are able to process simultaneously a very large number of data. It should also be noted that an impressive computational power is present not only in dedicated GPUs for high-performance computing, but also in commodity graphic cards, which make modern workstations suitable for numerical analyses. In order to exploit such a huge computational power, algorithms must be first redesigned and adapted to the SIMT (Single Instruction Multiple Thread) and SIMD (Single Instruction Multiple Data) paradigms and translated then into programming languages with hardware-specific subsets of instructions. Among them, one of the most diffuse is CUDA-C, a C extension for the Compute Unified Device Architecture (CUDA) that represents the core component of NVIDIA GPUs. As a matter of fact, the use of GPUs for scientific analysis, which dates back to mid and late 2000s [31]; [32]; [33]; [34] ;  [35], dramatically boosted with a two-digit yearly increasing rate since 2010. Just looking at the computational physics realm, several GPU-specific algorithms have been proposed in the last three years, e.g., for stochastic differential equations [36], molecular dynamics simulations [37] ;  [38], fluid dynamics [39] ;  [40], Metropolis Monte Carlo [41] simulations, quantum Monte Carlo simulations [42], and free-energy calculations [43]."

THOR

"We have designed and developed, from scratch, a global circulation model named THOR that solves the three-dimensional non-hydrostatic Euler equations. Our general approach lifts the commonly used assumptions of a shallow atmosphere and hydrostatic equilibrium. We solve the "pole problem" (where converging meridians on a sphere lead to increasingly smaller time steps near the poles) by implementing an icosahedral grid. Irregularities in the grid, which lead to grid imprinting, are smoothed using the "spring dynamics" technique. We validate our implementation of spring dynamics by examining calculations of the divergence and gradient of test functions. To prevent the computational time step from being bottlenecked by having to resolve sound waves, we implement a split-explicit method together with a horizontally explicit and vertically implicit integration. We validate our global circulation model by reproducing the Earth and also the hot Jupiter-like benchmark tests. THOR was designed to run on Graphics Processing Units (GPUs), which allows for physics modules (radiative transfer, clouds, chemistry) to be added in the future, and is part of the open-source Exoclimes Simulation Platform."


THOR: A New and Flexible Global Circulation Model to Explore Planetary Atmospheres - http://hgpu.org/?p=16280

Exoclimes Simulation Platform - http://www.exoclime.net/

The Exoclimes Simulation Platform (ESP) was born from a necessity to move beyond Earth-centric approaches to understanding atmospheres. Our dream and vision is to provide the exoplanet community with an open-source, freely-available, ultra-fast and cutting-edge set of simulational tools for studying exoplanetary atmospheres. The ESP harnesses the power of GPUs (graphic processing units), found in most Macs nowadays, to produce speed-ups at the order-of-magnitude level. These speed-ups are invested in building intuition and studying how atmospheric dynamics, chemistry and radiation interact in various ways.

HELIOS - GPU-Accelerated Radiative Transfer Code For Exoplanetary Atmospheres - https://github.com/exoclime/HELIOS

VULCAN - Atmospheric Chemistry - https://github.com/exoclime/VULCAN

Monday, September 12, 2016

Seaboard

"Seaboards are single ‘dashboard’ visualizations of the real time and forecast ocean data currently provided by SOCIB, from different coastal and ocean monitoring locations around the Balearic Islands. A specific set of Seaboards has been designed for the tourist sector and these are now installed in several collaborating hotels, providing useful…



GI-cat

"GI-cat features caching and mediation capabilities and can act as a broker towards disparate catalog and access services: by implementing metadata harmonization and protocol adaptation, it is able to transform query results to a uniform and consistent interface. GI-cat is based on a service-oriented framework of modular components and can be customized and tailored to support different deployment scenarios.

GI-cat can access a multiplicity of catalogs services, as well as inventory and access services to discover, and possibly access, heterogeneous ESS resources. Specific components implement mediation services for interfacing heterogeneous service providers which expose multiple standard specifications; they are called Accessors. These mediating components map the heterogeneous providers metadata models into a uniform data model which implements ISO 19115, based on official ISO 19139 schemas and its extensions Accessors also implement the query protocol mapping; they translate the query requests expressed according to the interface protocols exposed by GI-cat, into the multiple query dialects spoken by the resource service providers. Currently, a number of well-accepted catalog and inventory services are supported, including several OGC Web Services (e.g. WCS, WMS), THREDDS Data Server, SeaDataNet Common Data Index, and GBIF. A list of test endpoints is here available.

The supported sources are:

http://essi-lab.eu/do/view/GIcat

http://essi-lab.eu/do/view/GIcat/GIcatDocumentation

How to Configure GI-cat for the First Time - https://www.youtube.com/watch?v=28biJHTQSrM

 http://bcube.geodab.eu/bcube-broker/

GI-go GeoBrowser - http://essi-lab.eu/do/view/GIgo/WebHome

http://www.earthcube.org/workspace/bcube/brokering-accessor-hack-thon


AODN Open Geospatial Portal

"The AODN open geospatial portal is a Grails application for discovering, subsetting, and downloading geospatial data.  The application is a stateless front end to other servers: GeoNetwork metadata catalog, GeoServer data server (WMS and WFS), ncWMS web map server, and GoGoDuck netCDF subsetting and aggregation service."

https://github.com/aodn/aodn-portal

RAMADDA

"RAMADDA is a content repository and publishing platform with a focus on science data."

https://sourceforge.net/projects/ramadda/

RAMADDA on Docker - https://github.com/Unidata/ramadda-docker

Docker Unidata/RAMADDA - https://hub.docker.com/r/unidata/ramadda/

https://github.com/ScottWales/ramadda

https://github.com/Unidata/tomcat-docker

Siphon

"Siphon is a collection of Python utilities for downloading data from Unidata data technologies. Siphon’s current functionality focuses on access to data hosted on a THREDDS Data Server."

http://siphon.readthedocs.io/en/latest/

https://github.com/Unidata/siphon

Stetl

"Stetl, Streaming ETL, is an open source (GNU GPL) toolkit for the transformation (ETL) of geospatial data. Stetl is based on existing ETL tools like GDAL/OGR and XSLT. Stetl processing is driven from a configuration (.ini) file. Stetl is written in Python and in particular suited for processing GML.

Stetl basically glues together existing parsing and transformation tools like GDAL/OGR (ogr2ogr) and XSLT. By using native tools like libxml2 and libxslt (via Python lxml) Stetl is speed-optimized.

The core concepts of Stetl remain pretty simple: an input resource like a file or a database table is mapped to an output resource (also a file, a database, etc) via one or more filters. The input, filters and output are connected in a pipeline called a processing chain or Chain. This is a bit similar to a current in electrical engineering: an input flows through several filters, that each modify the current."

http://www.stetl.org/en/latest/

https://github.com/geopython/stetl

UV-CDAT

"UV-CDAT is a powerful and complete front-end to a rich set of visual-data exploration and analysis capabilities well suited for climate-data analysis problems.

UV-CDAT builds on the following key technologies:
  1. The Climate Data Analysis Tools (CDAT) framework developed at LLNL for the analysis, visualization, and management of large-scale distributed climate data;
  2. ParaView: an open-source, multi-platform, parallel-capable visualization tool with recently added capabilities to better support specific needs of the climate-science community;
  3. VisTrails, an open-source scientific workflow and provenance management system that supports data exploration and visualization;
  4. VisIt: an open-source, parallel-capable, visual-data exploration and analysis tool that is capable of running on a diverse set of platforms, ranging from laptops to the Department of Energy's largest supercomputers.
These combined tools, along with others such as the R open-source statistical analysis and plotting software and custom packages (e.g. vtDV3D), form UV-CDAT and provide a synergistic approach to climate modeling, allowing researchers to advance scientific visualization of large-scale climate data sets. The UV-CDAT framework couples powerful software infrastructures through two primary means:
  1. Tightly coupled integration of the CDAT Core with the VTK/ParaView infrastructure to provide high-performance, parallel-streaming data analysis and visualization of massive climate-data sets (other tighly coupled tools include VCS, VisTrails, DV3D, and ESMF/ESMP);
  2. Loosely coupled integration to provide the flexibility of using tools quickly in the infrastructure such as ViSUS, VisIt, R, and MatLab for data analysis and visualization as well as to apply customized data analysis applications within an integrated environment.
Within both paradigms, UV-CDAT will provide data-provenance capture and mechanisms to support data analysis via the VisTrails infrastructure."


https://github.com/UV-CDAT/uvcdat/wiki

https://uvcdat.llnl.gov/index.html


Installation

conda create -n uvcdat -c uvcdat uvcdat hdf5=1.8.16 pyqt=4.11.3
source activate uvcdat
source deactivate uvcdat 

GHCNpy

"The demand for weather, water, and climate information has been high, with an expectation of long, serially complete observational records in order to assess historical and current events in the Earth's system. While assessments have been championed through monthly and annual State of the Climate Reports produced at the National Centers for Environmental Information (NCEI, formerly NCDC), there is a demand for near-real time information that will address the needs of the atmospheric science community. The Global Historical Climatology Network – Daily data set (GHCN-D) provides a strong foundation of the Earth's climate on the daily scale, and is the official archive of daily data in the United States. The data set is updated nightly, with new data ingested with a lag of approximately one day. The data set adheres to a strict set of quality assurance, and lays the foundation for other products, including the 1981-2010 US Normals.

While a very popular data set, GHCN-Daily is only available in ASCII text or comma separated files, and very little visualization is provided to the end user. It makes sense then to build a suite of algorithms that will not only take advantage of its spatial and temporal completeness, but also help end users analyze this data in a simple, efficient manner. To that end, a Python package has been developed called GHCNPy to address these needs. Open sourced, GHCNPy uses basic packages such as Numpy, Scipy, and matplotlib to perform a variety of tasks. Routines include converting the data to CF compliant netCDF files, time series analysis, and visualization of data, from the station to global scale."

https://github.com/jjrennie/GHCNpy

https://ams.confex.com/ams/96Annual/webprogram/Paper283618.html

PyFerret

"In simplest terms, PyFerret is Ferret encapsulated in Python.

PyFerret is a Python module wrapping Ferret.  The pyferret module provides Python functions so Python users can easily take advantage of the Ferret's abilities to retrieve, manipulate, visualize, and save data.  There are also functions to move data between Python and the Ferret engine.   Python scripts used as Ferret external functions.

But PyFerret can also be used as a transparent replacement for the traditional Ferret executable.  A simple script starts Python and enters the pyferret module, giving the traditional Ferret interface.  This script also support all of Ferret's command-line options.

Inside the PyFerret wrapper is a complete, but enhanced Ferret engine.  One very noticeable enhancement is improved graphics which can be saved in common image formats.  (Sorry, no more GKS metafiles.)  Also, PyFerret comes packaged with many new statistical and shapefile function which are, in fact, Python scripts making use of third-party Python modules."

http://ferret.pmel.noaa.gov/Ferret/documentation/pyferret/

https://github.com/NOAA-PMEL/PyFerret

Installation

A PyFerret environment can be installed using conda.

conda create -n FERRET -c conda-forge pyferret --yes
 
Enter the environment via:
 
source activate FERRET
 
Exit the environment via:
 
source deactivate FERRET
 


Suricata

"Suricata is a high performance Network IDS, IPS and Network Security Monitoring engine. Open Source and owned by a community run non-profit foundation, the Open Information Security Foundation (OISF). Suricata is developed by the OISF and its supporting vendors."

https://suricata-ids.org/

How to Install Suricata on a Linux Box in 5 Minutes - https://danielmiessler.com/blog/how-to-install-suricata-on-a-linux-box-in-5-minutes/

Docker

"Docker is an open-source project that automates the deployment of Linux applications inside software containers.  Docker containers wrap up a piece of software in a complete filesystem that contains everything it needs to run: code, runtime, system tools, system libraries – anything you can install on a server. This guarantees that it will always run the same, regardless of the environment it is running in."

  https://www.docker.com/

 https://github.com/docker/docker

 https://en.wikipedia.org/wiki/Docker_(software)

https://github.com/bcicen/awesome-docker

Articles

Whales on a Place:  Deploying Software to NSF/NCAR Research Aircraft W/ Docker - https://sea.ucar.edu/event/whales-plane-deploying-software-nsf-ncar-research-aircraft-w-docker

Container Computing for Scientific Workflows - https://github.com/NERSC/2016-11-14-sc16-Container-Tutorial

Competition

Moving from Docker to rkt - https://medium.com/@adriaandejonge/moving-from-docker-to-rkt-310dc9aec938

Docker Compose

A tool for defining and running multi-container Docker applications.

https://docs.docker.com/compose/

FUN IMAGES

einsteintoolkit - https://github.com/eschnett/einsteintoolkit-docker

National Land Cover Database (NLCD)


http://www.mrlc.gov/nlcd2011.php


Completion of the 2011 National Land Cover Database - http://www.asprs.org/a/publications/pers/2015journals/PERS_May_2015/HTML/files/assets/basic-html/index.html#345

Distributed Oceanographic Match-Up Service (DOMS)

"The Distributed Oceanographic Match-up Service (DOMS) is a web-accessible service tool that will reconcile satellite and in situ datasets in support of NASA’s Earth Science mission. The service will provide a mechanism for users to input a series of geospatial references for satellite observations (e.g., footprint location, date, and time) and receive the in-situ observations that are “matched” to the satellite data within a selectable temporal and spatial domain. The inverse of inputting in-situ geospatial data (e.g., positions of moorings, floats, or ships) and returning corresponding satellite observations will also be supported. The DOMS prototype will include several characteristic in-situ and satellite observation datasets. For the in-situ data, the focus will be surface marine observations from the International Comprehensive Ocean-Atmosphere Data Set (ICOADS), the Shipboard Automated Meteorological and Oceanographic System Initiative (SAMOS), and the Salinity Processes in the Upper Ocean Regional Study (SPURS). Satellite products will include JPL ASCAT winds, Aquarius orbital/swath dataset, MODIS SST, and the high-resolution gridded MUR-SST product. Importantly, although DOMS will be established with these selected datasets, it will be readily extendable to other in situ and satellite collections, which could support additional science disciplines."

https://mdc.coaps.fsu.edu/doms

https://sea.ucar.edu/event/building-distributed-oceanography-match-service-doms-pair-field-observation-and-satellite-data

GTX1070 Linux installation and mining clue

"As you might have read i got 18 GTX1070 and posted some benchmark information earlier (http://forum.ethereum.org/discussion/comment/42663). @vaulter asked to some some details (@vaulter perhaps you can add your GTX1080 findings, settings?) In this topic i want to place the more technical notes how you can get this working. The short summery: Yes i managed to get 6x GTX1070 running at 218.11MH/s with heavy tuning / overclocking, but no idea how this would hold long term. Currently i keep them at 192,88MH/s (x3 rigs) which seem to be the 'safe overclocking defaults' to me. Who knows how stuff progresses with updates from @Genoil and if its running under a windows driver stable and fast. Safe to say with the lesser power consumption AND more MH/s then a card like R9 390X this GTX1070 with its price is a very nice card to have (especially if you run apps that only run good on Nvidia cards)

It took me quite a while to get it working and this document only contains 5% of my notes and stuff, its the minimum to get you started and you will need to do some tuning on your own to max your card out. Some stuff aren't as good as i like yet (e.g. headless VNC access without the use of a monitor) but it works and more importantly its stable. Thanks go out to @Genoil for his clue and his work on ethminer. This document is not entirely ment as a walk-through as some knowledge on mining, linux overclocking and common sense is still required ... So here goes."

https://forum.ethereum.org/discussion/7780/gtx1070-linux-installation-and-mining-clue-goodbye-amd-welcome-nvidia-for-miners

NVIDIA GeForce GTX 1070 On Linux: Testing With OpenGL, OpenCL, CUDA & Vulkan

14 June 2016

"NVIDIA sent over a GeForce GTX 1070 and I've been putting it through its paces under Linux with a variety of OpenGL, OpenCL, and Vulkan benchmarks along with CUDA and deep learning benchmarks. Here's the first look at the GeForce GTX 1070 performance under Ubuntu Linux.

In conjunction with the NVIDIA 367 proprietary driver on Linux, the GeForce GTX 1070 ran into no difficulties running under Ubuntu Linux throughout my initial testing. As noted in my GTX 1080 Linux review, there wasn't any overclocking support available when enabling the CoolBits options and this is a similar limitation with the GTX 1070 (yesterday NVIDIA did release a new 367 Linux driver that I have yet to test but its official change-log at least didn't make note of any overclocking additions).

Since my earlier GTX 1080 review, nothing has changed with regards to the open-source driver support. I have yet to see any experimental patches published for at least kernel mode-setting in Nouveau while any accelerated support for Pascal will not happen until NVIDIA is able to release the signed firmware binary images for usage by the Nouveau driver. I haven't received any word from NVIDIA Corp yet when that Pascal firmware availability is expected, but at least the proprietary driver support is in good shape.

...

 Well, that's all the initial data I have to share on the GeForce GTX 1070 after hammering it under Linux the past 24 hours. The GeForce GTX 1070 is a very nice upgrade over the GeForce GTX 900 series and especially if you are still using a Kepler graphics card or later. In many of our Linux benchmarks, the GeForce GTX 1070 was around 60% faster than the GTX 970! The GTX 1070 was commonly beating the GTX 980 Ti and GTX TITAN X while the GeForce GTX 1080 still delivers the maximum performance possible for a desktop graphics card at this time. The GTX 1070 (and GTX 1080) aren't only stunning for their raw performance but the power efficiency is also a significant push forward. Particularly when the GeForce GTX 1070 AIB cards begin appearing in the coming weeks at $399, the GeForce GTX 1070 should be a very nice option for Linux gamers looking to get the maximum performance for 1440p or 4K gaming. It will be fun to see later this month how the Radeon RX 480 compares, but considering the state of the Radeon Linux drivers, chances are you'll want to stick to the green side for the best Linux gaming experience unless you are a devout user of open-source drivers."

http://www.phoronix.com/scan.php?page=article&item=nvidia-gtx-1070&num=1

http://www.nvidia.com/download/driverresults.aspx/104284/en-us

Sunday, September 11, 2016

pymp

"This package brings OpenMP-like functionality to Python. It takes the good qualities of OpenMP such as minimal code changes and high efficiency and combines them with the Python Zen of code clarity and ease-of-use."

MetaMoprh

"A library framework designed to (automatically) extract as much computational capability as possible from HPC systems. Its design centers around three core principles: abstraction, interoperability, and adaptivity.

We realize MetaMorph2 as a layered library of libraries. Each tier implements one of the core principles of abstraction, interoperability, and adaptivity. The top-level user APIs and platform-specific back-ends exists as separate shared library objects, with interfaces designated in shared header files. Primarily, this encapsulation supports custom tuning of backends to a specific device or class of devices.   In addition, it allows back-ends to be separately used, distributed, compiled, or even completely rewritten, without interference with the other components.

The core API, library infrastructure and communication interface are written in standard C for portability and performance. Individual accelerator back-ends are generated in C with OpenMP and optional SIMD extensions (for CPU and Intel MIC), CUDA C/C++ (NVIDIA GPUs), and C++ with OpenCL (AMD GPUs/APUs and other devices). In addition, a wrapper around the top-level API is written in polymorphic Fortran 2003 to simplify interoperability with Fortran applications prevalent in some fields of scientific computing."

http://synergy.cs.vt.edu/

https://github.com/vtsynergy/MetaMorph

MetaMorph: A Library Framework for Interoperable Kernels on Multi- and Many-core Clusters - http://hgpu.org/?p=16446

Software Heritage

"Our ambition is to collect, preserve, and share all software that is publicly available in source code form. On this foundation, a wealth of applications can be built, ranging from cultural heritage to industry and research.

Software is an essential part of our lives. Given that any software component may turn out to be essential in the future, we do not make distinctions and collect all software that is publicly available in source code form.

We recognize that there is significant value in selecting among all this software some collections of particular interest, and we will encourage the construction of curated archives on top of Software Heritage.

We keep track of the origin of software we archive and store its full development history: this precious meta-information will be carefully harvested and structured for future use."

https://www.softwareheritage.org/

Gaalop

"Gaalop (Geometic Algebra Algorithms Optimizer) is a software to optimize geometric algebra files.

Algorithms can be developed by using the freely available CLUCalc software by Christian Perwass. Gaalop optimizes the algorithm and produces C++ (AMP), OpenCL, CUDA, CLUCalc or LaTeX output (other output-formats will follow)."

http://www.gaalop.de/

https://github.com/CallForSanity/Gaalop

Gaalop – High Performance Parallel Computing based on Conformal Geometric Algebra - http://www.gaalop.de/wp-content/uploads/Gaalop-High-PerformanceComputing-based-onConformal-Geometric-Algebra.pdf

Geometric Algebra Enhanced Precompiler for C++, OpenCL and Mathematica’s OpenCLLink - http://hgpu.org/?p=12044

CUDArray

"CUDArray is a CUDA-accelerated subset of the NumPy library. The goal of CUDArray is to combine the easy of development from the NumPy with the computational power of Nvidia GPUs in a lightweight and extensible framework.

The goal of CUDArray is to combine the ease of development from NumPy with the computational power of Nvidia GPUs in a lightweight and extensible framework. Since the motivation behind CUDArray is to facilitate neural network programming, CUDArray extends NumPy with a neural network submodule. This module has both a CPU and a GPU back-end to allow for experiments without requiring a GPU."

https://github.com/andersbll/cudarray

CUDArray: CUDA-based NumPy - http://hgpu.org/?p=13077

PENCIL

"PENCIL is rigorously-defined subset of GNU C99-enriched with additional language constructs-that enables compilers to exploit parallelism and produce highly optimized code when targeting accelerators. PENCIL aims to serve both as a portable implementation language for libraries, and as a target language for DSL compilers. We implemented a PENCIL-to-OpenCL backend using a state-of-the-art polyhedral compiler. The polyhedral compiler, extended to handle data-dependent control flow and non-affine array accesses, generates optimized OpenCL code. To demonstrate the potential and performance portability of PENCIL and the PENCIL-to-OpenCL compiler, we consider a number of image processing kernels, a set of benchmarks from the Rodinia and SHOC suites, and DSL embedding scenarios for linear algebra (BLAS) and signal processing radar applications (SpearDE), and present experimental results for four GPU platforms: AMD Radeon HD 5670 and R9 285, NVIDIA GTX 470, and ARM Mali-T604."

https://github.com/Meinersbur/pencilcc

PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming - http://hgpu.org/?p=14578

Velociraptor

"Velociraptor is a compiler toolkit designed for array-based languages that is not tied to a single source language. You can use it to build compilers and other tools for array-based languages. Velociraptor takes as input a high-level IR called Velociraptor IR (VRIR for short) which is described below. Velociraptor provides multiple components that can be used to build compilers. The first component libVRIR is an analysis and transformation infrastructure that can be used by both static and dynamic tools. The second component, VRdino, is a dynamic compiler that generates LLVM for CPUs, and OpenCL for GPUs from VRIR.

VRIR is a typed, AST based IR used by Velociraptor. VRIR is designed to be easy to generate from array-based languages. It has built-in operators for many high-level array operations (like matrix multiplication), flexible array indexing schemes, and support for various array layouts. VRIR also includes high-level constructs for parallelism and GPU acceleration. VRIR has a textual representation inspired from Lisp S-expressions.

VRdino is the dynamic backend for Velociraptor. It generates LLVM for CPUs and OpenCL for GPUs, making it portable across many different systems. It performs some interesting optimizations, such as runtime code specialization of regions identified using region detection analysis. VRdino is accompanied by its intelligent runtime system, that is designed for maximizing throughput in hybrid systems by doing operations in parallel and by minimizing redundant data transfers. VRdino depends upon libVRIR, LLVM, OpenCL, BLAS and RaijinCL."


A Toolkit for Building Dynamic Compilers for Array-Based Languages Targeting CPUs and GPUs - http://hgpu.org/?p=14628

MAGMA

"Matrix Algebra on GPU and Multicore Architectures (MAGMA) is a collection of next generation linear algebra (LA) libraries for heterogeneous architectures. The MAGMA package supports interfaces for current LA packages and standards, e.g., LAPACK and BLAS, to allow computational scientists to easily port any LA-reliant software components to heterogeneous architectures. MAGMA allows applications to fully exploit the power of current heterogeneous systems of multi/many-core CPUs and multi-GPUs/coprocessors to deliver the fastest possible time to accurate solution within given energy constraints.

MAGMA 1.6 features top performance and high accuracy LAPACK compliant routines for multicore CPUs enhanced with NVIDIA GPUs and includes more than 400 routines, covering one-sided dense matrix factorizations and solvers, two-sided factorizations and eigen/singular-value problem solvers, as well as a subset of highly optimized BLAS for GPUs. In 2014, the MAGMA Sparse and MAGMA Batched packages were added with the MAGMA 1.6 release, providing support for sparse iterative and batched linear algebra on a set of small matrices in parallel, respectively. MAGMA provides multiple precision arithmetic support (S/D/C/Z, including mixed-precision). Most of the algorithms are hybrid, using both multicore CPUs and GPUs, but starting with the 1.6 release, GPU-specific algorithms were added. MAGMA also supports AMD GPUs (clMAGMA 1.3) and Intel Xeon Phi coprocessors (MAGMA MIC 1.3)."


MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing - http://icl.eecs.utk.edu/magma/

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems - http://superfri.org/superfri/article/view/90

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers - https://hgpu.org/?p=17870

MetaFork

"MetaFork is a compilation framework for concurrency platforms targeting hardware acceleration technologies. As of today, it consists of a multithreaded language, also called MetaFork, and software tools for performing automatic program translations between CilkPlus, OpenMP and MetaFork."


MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators and Its Application to the Generation of Parametric CUDA Kernels - http://hgpu.org/?p=14708

BLASX

"A highly optimized multi-GPU level-3 BLAS. We adopt the concepts of algorithms-by-tiles treating a matrix tile as the basic data unit and operations on tiles as the basic task. Tasks are guided with a dynamic asynchronous runtime, which is cache and locality aware. The communication cost under BLASX becomes trivial as it perfectly overlaps communication and computation across multiple streams during asynchronous task progression. It also takes the current tile cache scheme one step further by proposing an innovative 2-level hierarchical tile cache, taking advantage of inter-GPU P2P communication. As a result, linear speedup is observable with BLASX under multi-GPU configurations; and the extensive benchmarks demonstrate that BLASX consistently outperforms the related leading industrial and academic projects such as cuBLAS-XT, SuperMatrix, MAGMA and PaRSEC."

https://github.com/linnanwang/BLASX

BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing - http://hgpu.org/?p=14743

Devito

"Devito is a new tool for performing optimised Finite Difference (FD) computation from high-level symbolic problem definitions. Devito performs automated code generation and Just-In-time (JIT) compilation based on symbolic equations defined in SymPy to create and execute highly optimised Finite Difference kernels on multiple computer platforms."


Devito: automated fast finite difference computation - http://arxiv.org/abs/1608.08658

OmpSs

"OmpSs is an effort to integrate features from the StarSs programming model developed by BSC into a single programming model. In particular, our objective is to extend OpenMP with new directives to support asynchronous parallelism and heterogeneity (devices like GPUs). However, it can also be understood as new directives extending other accelerator based APIs like CUDA or OpenCL. Our OmpSs environment is built on top of our Mercurium compiler and Nanos++ runtime system."

https://pm.bsc.es/ompss

A Survey: Runtime Software Systems for High Performance Computing - http://superfri.org/superfri/article/view/126

"Nanos++ is a runtime designed to serve as runtime support in parallel environments. It is mainly used to support  OmpSs, a extension to OpenMP developed at BSC. It also has modules to support  OpenMP and  Chapel.

Nanos++ provides services to support task parallelism using synchronizations based on data-dependencies. Data parallelism is also supported by means of services mapped on top of its task support. Task are implemented as user-level threads when possible (currently x86,x86-64,ia64,ppc32 and ppc64 are supported).

Nanos++ also provides support for maintaining coherence across different address spaces (such as with GPUs or cluster nodes). It provides software directory and cache modules to this end."

https://pm.bsc.es/nanox

 "Mercurium is a source-to-source compilation infrastructure aimed at fast prototyping. Current supported languages are C, C++ and Fortran. Mercurium is mainly used in Nanos environment to implement OpenMP but since it is quite extensible it has been used to implement other programming models or compiler transformations, examples include Cell Superscalar, Software Transactional Memory, Distributed Shared Memory or the ACOTES project, just to name a few.

Extending Mercurium is achieved using a plugin architecture, where plugins represent several phases of the compiler. These plugins are written in C++ and dynamically loaded by the compiler according to the chosen configuration. Code transformations can be implemented in terms of source code (there is no need to modify or know the internal syntactic representation of the compiler)."

https://pm.bsc.es/mcxx

Extending OmpSs for OpenCL kernel co-execution in heterogeneous systems - https://hgpu.org/?p=17897

SPIR

"SPIR (Standard Portable Intermediate Representation) was initially developed for use by OpenCL and SPIR versions 1.2 and 2.0 were based on LLVM. SPIR has now evolved into a true cross-API standard that is fully defined by Khronos with native support for shader and kernel features – called SPIR-V.

SPIR-V is the first open standard, cross-API intermediate language for natively representing parallel compute and graphics and is incorporated as part of the core specification of both OpenCL 2.1 and OpenCL 2.2 and the new Vulkan graphics and compute API.

SPIR-V exposes the machine model for OpenCL 1.2, 2.0, 2.1, 2.2 and Vulkan - including full flow control, and graphics and parallel constructs not supported in LLVM. SPIR-V also supports OpenCL C and OpenCL C++ kernel languages as well as the GLSL shader language for Vulkan (under development).

SPIR-V 1.1, launched in parallel with OpenCL 2.2, now supports all the kernel language features of OpenCL C++ in OpenCL 2.2, including initializer and finalizer function execution modes to support constructors and destructors. SPIR-V 1.1 also enhances the expressiveness of kernel programs by supporting named barriers, subgroup execution, and program scope pipes."



Python


Python API and tools for manipulating and optimizing SPIR-V.

ViennaCL

"ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP (including switches at runtime).

A Python wrapper named PyViennaCL is also available.

http://viennacl.sourceforge.net/

http://viennacl.sourceforge.net/157.html

OpenDwarfs

"The OpenDwarfs project provides a benchmark suite consisting of different computation/communication idioms, i.e., dwarfs, for state-of-art multicore and GPU systems. The first instantiation of the OpenDwarfs has been realized in OpenCL.

The proliferation of heterogeneous computing platforms presents the parallel computing community with new challenges. One such challenge entails evaluating the efficacy of such parallel architectures and identifying the architectural innovations that ultimately benefit applications. To address this challenge, we need benchmarks that capture the execution patterns (i.e., dwarfs or motifs) of applications, both present and future, in order to guide future hardware design. Furthermore, we desire a common programming model for the benchmarks that facilitates code portability across a wide variety of different processors (e.g., CPU, APU, GPU, FPGA, DSP) and computing environments (e.g., embedded, mobile, desktop, server). As such, we present the latest release of OpenDwarfs, a benchmark suite that currently realizes the Berkeley dwarfs in OpenCL, a vendor-agnostic and open-standard computing language for parallel computing. Using OpenDwarfs, we characterize a diverse set of modern fixed and reconfigurable parallel platforms: multi-core CPUs, discrete and integrated GPUs, Intel Xeon Phi co-processor, as well as a FPGA. We describe the computation and communication patterns exposed by a representative set of dwarfs, obtain relevant profiling data and execution information, and draw conclusions that highlight the complex interplay between dwarfs’ patterns and the underlying hardware architecture of modern parallel platforms."


http://view.eecs.berkeley.edu/wiki/Dwarf_Mine

http://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators

OpenDwarfs: Characterization of Dwarf-Based Benchmarks on Fixed and Reconfigurable Architectures - http://hgpu.org/?p=15143

Towards Enhancing Performance, Programmability, and Portability in Heterogeneous Computing - https://hgpu.org/?p=17224

Our proposed approach is based on OpenCL implementations of the Berkeley dwarfs. We use our benchmark suite (OpenDwarfs) in characterizing performance of state-of-the-art parallel architectures, and as the main component of a methodology (Telescoping Architectures) for identifying trends in future heterogeneous architectures. Furthermore, we employ OpenDwarfs in a multi-faceted study on the gaps between the three P’s in the context of the modern heterogeneous computing landscape. Our case-study spans a variety of compilers, languages, optimizations, and target architectures, including the CPU, GPU, MIC, and FPGA. Based on our insights, and extending aspects of prior research (e.g., in compilers, programming languages, and auto-tuning), we propose the introduction of grid-based data structures as the basis of programming frameworks and present a prototype unified framework (GLAF) that encompasses a novel visual programming environment with code generation, auto-parallelization, and auto-tuning capabilities. Our results, which span scientific domains, indicate that our holistic approach constitutes a viable alternative towards enhancing the three P’s and further democratizing heterogeneous, parallel computing for non-programming-savvy audiences, and especially domain scientists.

HPAT

"Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.

HPAT automatically parallelizes analytics tasks written in Julia, generates efficient MPI/C++ code, and uses existing high performance libraries such as HDF5 and Intel® Data Analytics Acceleration Library (Intel® DAAL). HPAT is based on ParallelAccelerator and CompilerTools packages."

ParallelAccelerator

"ParallelAccelerator package is a compiler framework that aggressively optimizes compute-intensive Julia programs on top of the Julia compiler.

Under the hood, ParallelAccelerator is essentially a compiler – itself implemented in Julia – that intercepts the usual Julia JIT compilation process for @acc-annotated functions. It compiles @acc-annotated code to C++ OpenMP code, which can then be compiled to a native library by an external C++ compiler such as GCC or ICC."

https://github.com/IntelLabs/ParallelAccelerator.jl

http://parallelacceleratorjl.readthedocs.io/en/latest/

http://julialang.org/blog/2016/03/parallelaccelerator

https://github.com/IntelLabs/HPAT.jl

Saturday, September 10, 2016

PyCUDA

"PyCUDA lets you access Nvidia‘s CUDA parallel computation API from Python.  Several wrappers of the CUDA API already exist-so what's so special about PyCUDA?
  • Object cleanup tied to lifetime of objects. This idiom, often called RAII in C++, makes it much easier to write correct, leak- and crash-free code. PyCUDA knows about dependencies, too, so (for example) it won't detach from a context before all memory allocated in it is also freed.
  • Convenience. Abstractions like pycuda.driver.SourceModule and pycuda.gpuarray.GPUArray make CUDA programming even more convenient than with Nvidia's C-based runtime.
  • Completeness. PyCUDA puts the full power of CUDA's driver API at your disposal, if you wish. It also includes code for interoperability with OpenGL.
  • Automatic Error Checking. All CUDA errors are automatically translated into Python exceptions.
  • Speed. PyCUDA's base layer is written in C++, so all the niceties above are virtually free.
  • Helpful Documentation and a Wiki.
Relatedly, like-minded computing goodness for OpenCL is provided by PyCUDA's sister project PyOpenCL."

https://mathema.tician.de/software/pycuda/

https://documen.tician.de/pycuda/

https://github.com/inducer/pycuda

PyOpenCL

"PyOpenCL lets you access the OpenCL parallel computation API from Python.  It tries to offer computing goodness in the spirit of its sister project PyCUDA:
  • Object cleanup tied to lifetime of objects. This idiom, often called RAII in C++, makes it much easier to write correct, leak- and crash-free code.
  • Completeness. PyOpenCL puts the full power of OpenCL's API at your disposal, if you wish. Every obscure get_info() query and all CL calls are accessible.
  • Automatic Error Checking. All CL errors are automatically translated into Python exceptions.
  • Speed. PyOpenCL's base layer is written in C++, so all the niceties above are virtually free.
  • Helpful and complete Documentation as well as a Wiki.
  • Liberal license. PyOpenCL is open-source under the MIT license and free for commercial, academic, and private use.
  • Broad support. PyOpenCL was tested and works with Apple's, AMD's, and Nvidia's CL implementations.
To use PyOpenCL, you just need numpy and an OpenCL implementation. (See this howto for how to get one.)"

https://mathema.tician.de/software/pyopencl/

https://documen.tician.de/pyopencl/

https://mathema.tician.de/software/

https://github.com/inducer/pyopencl-feedstock

Loo.py

"A code generator for array-based code on CPUs and GPUs.

Loopy lets you easily generate the tedious, complicated code that is necessary to get good performance out of GPUs and multi-core CPUs.

Loopy’s core idea is that a computation should be described simply and then transformed into a version that gets high performance. This transformation takes place under user control, from within Python.

It can capture the following types of optimizations:
  • Vector and multi-core parallelism in the OpenCL/CUDA model
  • Data layout transformations (structure of arrays to array of structures)
  • Loopy Unrolling
  • Loop tiling with efficient handling of boundary cases
  • Prefetching/copy optimizations
  • Instruction level parallelism
  • and many more
Loopy targets array-type computations, such as the following:
  • dense linear algebra,
  • convolutions,
  • n-body interactions,
  • PDE solvers, such as finite element, finite difference, and Fast-Multipole-type computations
It is not (and does not want to be) a general-purpose programming language."

https://github.com/inducer/loopy

https://mathema.tician.de/software/loopy/

https://documen.tician.de/loopy/

http://hgpu.org/?p=15782

clMathLibraries

clFFT

"clFFT is a software library containing FFT functions written in OpenCL. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and heterogeneous programming."

https://github.com/clMathLibraries/clFFT

clSPARSE

"An OpenCL™ library implementing Sparse linear algebra routines."

https://github.com/clMathLibraries/clSPARSE

http://hgpu.org/?p=15942

clBLAS

"A software library containing BLAS functions written in OpenCL."

https://github.com/clMathLibraries/clBLAS

clRNG

"An OpenCL based software library containing random number generation functions."

https://github.com/clMathLibraries/clRNG


gprm

"The Glasgow Parallel Reduction Machine, GPRM, is a task-based parallel programming framework.

In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads. We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called "Global Sharing", which improves performance in multiprogramming situations."

https://github.com/wimvanderbauwhede/gprm

http://hgpu.org/?p=16054

tccg

"The Tensor Transpose Compiler (TTC) generates high-performance parallel and vectorized C++ code for multidimensional tensor transpositions.

We present "GEMM-like Tensor-Tensor multiplication" (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned "macro-kernel". This macro-kernel operates on suitably prepared ("packed") sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source "Tensor Contraction Code Generator" (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.4×. More precisely, GETT achieves speedups of up to 1.41× over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions."

https://github.com/HPAC/tccg

https://github.com/HPAC/TTC

https://arxiv.org/abs/1607.00145

https://arxiv.org/abs/1607.01249

https://arxiv.org/abs/1603.02297