Python statistics and analytics

I saw two links on Python for statistics this morning, and that made me think that I should share those and the libraries I know of for doing analytics in Python.

First of all, the ones that inspired the post:

Statistical analysis made easy in Python with SciPy and pandas DataFrames. This post is a neat summary of how to work with multiple tables of sampling results in a concise manner using tools that are readily developed.  No more reading in CSV files line by line, doing your own conversions of values, and then writing statistical routines that you later have to optimize.  This and the library mentioned in the comments will get you so much further, and both incorporate and extend numpy and scipy, so you can write your own optimized routines that use the components from these libraries.

Finally, since this is a spatial blog, I want to point out Luc Anselin and ASU’s excellent routines for Spatial Analysis, PySAL.  These routines provide clustering, smoothing, computational geometry, spatial weighting, econometrics, and dynamics functions. The computational geometry routines are of a different class than those that appear in Shapely, but that is not to dismiss Shapely, which is an excellent library for handing the kinds of transformations and spatial joins common in everyday analysis.

As for what Geoanalytics uses in its core functionality, we have:

  • Shapely – computational geometry.
  • Numpy – numerical computation over rasters and irregular grids, palleting.
  • Scipy – interpolation, image processing.
  • Numexpr – high speed bulk arithmetic on arrays.
  • Pandas – Paired with OGR, an array interface to feature collections.
  • GEOS – toplogy and transformation library inspired by the JTS.
  • GDAL/OGR – file format conversions, spatial indexing.
  • Spatialite3 – Spatial indexing.
  • PostGIS – GIS in Postgres, used by the GeoDjango ORM layer
  • GeoDjango – Geographic extensions to Django.

Scientists write analytics in a wide range of languages and frameworks, however, including Fortran, GRASS, C, and Java.  Because of this, Geoanalytics is decoupled from the underlying analytics platform giving scientists the independence they need to incorporate their existing models and analytics.  Geoanalytics instead focuses on providing source data and publishing results in common, easily accessible formats, executing computations on marshalled computational resources, and providing the framework for new geographic analytics software written in Python.