# Purpose: Project description for 2013 UCI Calit2 Surf-IT program # (Summer Undergraduate Research Fellowships in Information Technologies) # Submitted to Stu Ross 20130228 (20130520: Notes on future SURF-IT ads: 1. Make first affiliation CS not ESS to get more CS applicants 2. Format ads so Windows line-breaking won't create widows/orphans 3. Prerequisites: Good programming skills at the level of third year computer science. Proficiency with C and a scripting language.) Optimized Storage Shapes for Multi-dimensional Gridded Datasets PI: Prof. Charlie Zender Departments of Earth System Science and Computer Science Description: Many if not most geophysical datasets such as climate simulations are stored in a self-describing data format called netCDF. Data access speeds vary by factors of thousands, and depends primarily upon how well their storage layout matches the hyperslab request. This project will improve understanding and parameterization of the optimal layout (i.e., the "chunking") to maximize fast access and minimize slow access to netCDF datasets. Our netCDF Operators (NCO) are a widely used, opensource toolkit for manipulating and analyzing (statistics, trends, comparison with observations) netCDF data. NCO supports a range of chunking policies, but has no heuristic for guiding the user on optimal chunking. The student will first conduct sensitivity tests to benchmark access times for common hyperslab requests. Then the student will construct and implement new, optimal chunking policies. The first few weeks would be devoted to literature review and to scripting benchmark tests to assess the dependence of wallclock time on data layout. The next few weeks would be analysis and hypothesis testing of generic chunking policies motivated by the benchmarking results. The last few weeks would be implementation and analysis of optimized chunking policies in NCO. Prerequisites: Proficiency with C and multi-dimensional data Outcomes: Skills and understanding of scientific data analysis, benchmarking, interpretation of results. Recommended Web sites and publications: 1. Chunking Data: Why it Matters http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters 2. Efficient Organization of Large Multidimensional Arrays http://cs.brown.edu/courses/cs227/archives/2008/Papers/FileSystems/sarawagi94efficient.pdf 3. Optimal Chunking of Large Multidimensional Arrays for Data Warehousing http://www.escholarship.org/uc/item/35201092 4. netCDF Operators http://nco.sf.net 5. Zender, C. S., and H. J. Mangalam (2007), Scaling Properties of Common Statistical Operators for Gridded Datasets, Int. J. High Perform. Comput. Appl., 21(4), 485-498, doi:10.1177/1094342007083802. 6. Zender, C. S. (2008), Analysis of Self-describing Gridded Geoscience Data with netCDF Operators (NCO), Environ. Modell. Softw., 23(10), 1338-1342, doi:10.1016/j.envsoft.2008.03.004. ************************************************************************ NB: Following project not "researchy" enough, maybe next year ************************************************************************ Next Generation Parser for Structured Data Analysis PI: Prof. Charlie Zender Departments of Earth System Science and Computer Science Description: Many if not most geophysical datasets such as climate simulations are stored in a self-describing data format called netCDF. Our netCDF Operators (NCO) are a widely used, opensource toolkit for manipulating and analyzing (statistics, trends, comparison with observations) netCDF data. This project will utilize ANTLR (ANother Tool for Language Recognition) to generate the NCO language parser in C++. Our goals are two-fold: 1. To create and efficient, extensible parser for structured data analysis. 2. To enhance parallelism in geophysical data analysis involving structured data with storage constraints. Prerequisites: Familiarity with C/C++ and data Outcomes: Skills and understanding of scientific language construction, data analysis, open source software development and climate change. Recommended Web sites and publications: 1. ANother Tool for Language Recognition http://www.antlr.org 2. netCDF Operators http://nco.sf.net 3. Zender, C. S., and H. J. Mangalam (2007), Scaling Properties of Common Statistical Operators for Gridded Datasets, Int. J. High Perform. Comput. Appl., 21(4), 485-498, doi:10.1177/1094342007083802. 4. Zender, C. S. (2008), Analysis of Self-describing Gridded Geoscience Data with netCDF Operators (NCO), Environ. Modell. Softw., 23(10), 1338-1342, doi:10.1016/j.envsoft.2008.03.004.