# Purpose: Project description for 2014 UCI Calit2 Surf-IT program # (Summer Undergraduate Research Fellowships in Information Technologies) # 20140416: Submitted to Brittany Gray cat > ${HOME}/sdn_ugr_03_adv.txt << 'EOF' Optimized Storage Shapes for Multi-dimensional Gridded Datasets PI: Prof. Charlie Zender Departments of Computer Science and of Earth System Science Description: Many if not most geophysical datasets such as climate simulations are stored in a self-describing data format called netCDF. Data access speeds vary by factors of thousands, and depends primarily upon how well their storage layout matches the hyperslab request. This project will improve understanding and parameterization of the optimal layout (i.e., the "chunking") to maximize fast access and minimize slow access to netCDF datasets. Our netCDF Operators (NCO) are a widely used, opensource toolkit for manipulating and analyzing (statistics, trends, comparison with observations) netCDF data. NCO supports a range of chunking policies, yet has no heuristic for guiding the user on optimal chunking. The student will first conduct sensitivity tests to benchmark access times for common hyperslab requests. Then the student will construct and implement new, optimal chunking policies. The goal is for the student to systematically "discover" the optimal chunking algorithm for a given multi-dimensional array shape. The first few weeks would be devoted to literature review and to scripting benchmark tests to assess the dependence of wallclock time on data layout. The next few weeks would be analysis and hypothesis testing of generic chunking policies motivated by the benchmarking results. The last few weeks would be implementation and analysis of optimized chunking policies in NCO. Prerequisites: Good programming skills at the level of third year computer science. Proficiency with C and a scripting language. Outcomes: Skills and understanding of scientific data analysis, benchmarking, interpretation of results. Recommended Web sites and publications: 1. Chunking Data: Why it Matters http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters 2. Efficient Organization of Large Multidimensional Arrays http://cs.brown.edu/courses/cs227/archives/2008/Papers/FileSystems/sarawagi94efficient.pdf 3. Optimal Chunking of Large Multidimensional Arrays for Data Warehousing http://www.escholarship.org/uc/item/35201092 4. netCDF Operators http://nco.sf.net 5. Zender, C. S., and H. J. Mangalam (2007), Scaling Properties of Common Statistical Operators for Gridded Datasets, Int. J. High Perform. Comput. Appl., 21(4), 485-498, doi:10.1177/1094342007083802. 6. Zender, C. S. (2008), Analysis of Self-describing Gridded Geoscience Data with netCDF Operators (NCO), Environ. Modell. Softw., 23(10), 1338-1342, doi:10.1016/j.envsoft.2008.03.004. EOF scp ${HOME}/sdn_ugr_03_adv.txt dust.ess.uci.edu:/var/www/html/hire URL: http://dust.ess.uci.edu/hire/sdn_ugr_03_adv.txt ************************************************************************ NB: Following project not "researchy" enough, maybe next year ************************************************************************ cat > ${HOME}/sdn_ugr_02_adv.txt << 'EOF' Next Generation Parser for Structured Data Analysis PI: Prof. Charlie Zender Departments of Earth System Science and Computer Science Description: Many if not most geophysical datasets such as climate simulations are stored in a self-describing data format called netCDF. Our netCDF Operators (NCO) are a widely used, opensource toolkit for manipulating and analyzing (statistics, trends, comparison with observations) netCDF data. This project will utilize ANTLR (ANother Tool for Language Recognition) to generate the NCO language parser in C++. Our goals are two-fold: 1. To create and efficient, extensible parser for structured data analysis. 2. To enhance parallelism in geophysical data analysis involving structured data with storage constraints. Prerequisites: Familiarity with C/C++ and data Outcomes: Skills and understanding of scientific language construction, data analysis, open source software development and climate change. Recommended Web sites and publications: 1. ANother Tool for Language Recognition http://www.antlr.org 2. netCDF Operators http://nco.sf.net 3. Zender, C. S., and H. J. Mangalam (2007), Scaling Properties of Common Statistical Operators for Gridded Datasets, Int. J. High Perform. Comput. Appl., 21(4), 485-498, doi:10.1177/1094342007083802. 4. Zender, C. S. (2008), Analysis of Self-describing Gridded Geoscience Data with netCDF Operators (NCO), Environ. Modell. Softw., 23(10), 1338-1342, doi:10.1016/j.envsoft.2008.03.004. EOF scp ${HOME}/sdn_ugr_02_adv.txt dust.ess.uci.edu:/var/www/html/hire URL: http://dust.ess.uci.edu/hire/sdn_ugr_02_adv.txt