Current: Data Science and Systems

Program Requirements

All EECS MEng students should expect to complete four (4) technical courses within the EECS department at the graduate level, the Fung Institute's engineering leadership curriculum, as well as a capstone project that will be hosted by the EECS department. You must select a project from the list here.

2017-2018 Capstone Projects

For the capstone projects for Master of Engineering in Electrical Engineering and Computer Science (EECS) our department believes that our students are going to have a significantly better experience if the projects are followed closely by an EECS professor throughout the academic year. To ensure this, we have asked the faculty in each area for which the Master of Engineering is offered in our department to formulate one or more project ideas that the incoming students will have to choose from.

Project 1

Title Search Engine for Software (advisor Prof. Marti Hearst)

Description The web has transformed many aspects of how software is written and maintained. For myriad tasks, free open source packages are available. However, the open sharing model has been so successful that developers are now inundated with a plethora of choices when deciding which package to use for a task. The task of searching for and comparing multiple different packages is not well supported on web search engines. One aspect of a software package that is especially difficult to determine from web search is the degree to which the software community supports it, via tutorials, answers to questions on discussion forums and Q&A sites, and keeping code up to date. It’s also difficult to determine when one package has gone out of favor and a new one is in ascendence. This project will create a crawler, indexer, ranker, and search user interface that displays up-to-date and comprehensive information about software packages, including the social systems surrounding them. This requires indexing and displaying up to date information and excellent user interface design. Multiple teams can work on different aspects of this project.

Project 2

Title - Generating Explanations of Code Snippets (advisor Prof. Marti Hearst)

Description - This project extends the Tutorons project (see http://tutorons.com).

Tutorons automatically detect and explain code embedded in tutorial pages. They make context-relevant descriptions of the purpose and syntax of code available available from a tooltip with a single click. They generate on-demand natural language explanations, usage examples, and visualizations to explain confusing code a programmer found online.
This project will extend the Tutorons project by addressing the following questions: 
  • Generating explainers: Can we teach a machine to explain code by giving it example explanations?
  • Evaluation: How can we design genuinely helpful, information-rich explanations for searchers?
  • Explaining larger programming languages: Do the same patterns work?
  • How to engage tutorons in the larger programming ecosystem?

Project 3

TitleText Analytics for Congressional Records (advisor Prof. Laurent El Ghaoui)

DescriptionThis projects seeks to provide real-time summaries of congressional records, in order to increase transparency of government. The technologies involved include Natural Language Processing and large-scale machine learning.

Project 4

Title - Representation Learning for Text (advisor Prof. Laurent El Ghaoui)

Description - Representation learning aims at automating the important features to represent text collections in numerical form, for text analytics processing. The project will explore the state-of-the-art and apply the ideas to several data sets, including, but not limited to, Twitter data, a set 22,000 letters from the 19th century, congressional records, financial news and flight safety reports.

Project 5

Title - Optimization of a Wind Farm (advisor Prof. Laurent El Ghaoui)

Description - These teams will study the optimization of a wind farm for better efficiency. It is estimated that with better design and control, the production costs can be improved very substantially, to about 2-4%. The first step will be to build a model for estimating the production as a function of weather parameter, with an approach that builds a "surrogate" (approximate) model on top of a complex 3-D computational fluid dynamics engine. The second step is to investigate the use of the surrogate model in order to optimize a number of wind turbine parameters. This work will be in collaboration of a large electricity company in the context of a real-life system.

 

Project 6

TitleIntelligent Collaborative Radio Networks (advisors Prof. Anant Sahai &  Prof. John Wawryznek)

DescriptionThe next generation of radio systems are going to be agile, intelligent, self-configuring, and collaborative. This project is in conjunction with the DARPA Spectrum Challenge and we will have multiple teams working on different aspects of a new software-defined radio system featuring collaborative intelligence. Different backgrounds are welcome for team members --- ranging from a FPGA-targeted digital design to networking to signal processing to human/computer interaction to machine learning and game theory. 

Project 7

Title - Capturing Data Science Lineage (advisor Prof. Joe Hellerstein)

DescriptionIn this project, students will harvest "data exhaust"--behavioral metadata--from data science workflows, and record that metadata in the Ground context service. Goals include data provenance attestation and experimental reproducibility. Students will work with data science and analysis tools like Spark, Jupyter Notebooks, Trifacta or Tableau. 

Project 8

Title - Benchmarking for De Novo Genome Assembly (advisor Prof. Tom Courtade)

Description - Long-read assembly technologies are paving the way for de novo assembly of reference-quality genomes. Several assemblers have been released over the past year, but their respective advantages and disadvantages are largely unquantified. This project will develop a benchmarking dataset on which available assemblers will be evaluated, thus allowing for apples-to-apples comparisons. No prior knowledge of DNA assembly is required, but programming experience and a willingness to work with real datasets and beta tools are needed.

Project 9

Title - Automatic Data Exploration (advisor Prof. Dawn Song)

Description - The explosive growth of data generation and collection has created exciting new opportunities for data analysis and exploration. However, existing information retrieval and analysis tools are only effective when one has at least a general idea of what constitutes as "interesting". The goal of this project is to apply deep learning and reinforcement learning techniques to automatically guide the user towards interesting and previously unexplored aspects of the data. The framework created by this project will receive one or more datasets and through an interactive process will guide the user towards new and relevant insights. Another key element of the framework is its ability to perform transfer learning - the application of insights learned in one dataset to another. Through the course of this project we will be using deep learning tools such as TensorFlow, data structuring and analysis tools such as Pandas and different visualization tools.

Project 10

Title - Large-Scale Video Retrieval (advisor Prof. Gerald Friedland)

Description - Students will work in teams to create different search engine prototypes using the Multimedia Commons infrastructure around 100M images and 1M videos on Amazon EC2. They will document and build systems that can be used as middleware for integration with other data sources.

Project 11

Title - Data Science for Analysis of Large Scale Mobility Patterns (advisor Prof. Alexandre Bayen)

Description - The explosion of smartphones has led the majority of motorists to drive "under the influence of apps". The proliferation of navigational apps and services designed to help drivers avoid traffic congestion result in a shift in the types of routes people take. Increased traffic in residential areas has provoked community backlash against the apps and the companies that provide them. Cities are now launching a war against these apps and trying to resist these new types of traffic jams caused by navigation apps. This project will use machine learning and data analytics to model these phenomena and propose new solutions, in the context of connected and automated vehicles.

Project 12

Title Scaling up Deep Learning and Reinforcement Learning (advisor Prof. John Canny)

Description This project is about scaling deep learning and reinforcement learning on clusters of computers. Current-generation distributed learning systems uses shared parameter stores which are often a bottleneck when optimizing complex models. They also have poor error-tolerance and a limited repertoire of distributed updates. Next generation systems use "shared nothing" design which optimizes throughput, provides cheap error tolerance, and supports a much more general set of distributed updates. This design opens the door to richer and more efficient distributed optimization strategies, including Monte Carlo and tree search. This project is designed for a team of four students to extend our current work on shared-nothing distributed ML focusing on error tolerance and distributed Monte-Carlo. We also hope to establish new performance benchmarks for a number of deep learning and reinforcement learning problems. The project includes regular collaboration with one or more industry partners.

Project 13

Title - Interactive Machine Learning and Visual Explanations (advisor Prof. John Canny)

Description Machine Learning (ML) is the method of first resort for many challenges in computing and data science. But there is often a gap between users conceptual models of ML system behavior and reality. This is particularly acute for deep neural models (DL), whose structure often has no relation to actual or perceived structure in the problem domain. This project is for a team of four students to extend our work on *interactive machine learning tools* and their application in ML and DL. Interactive modeling allows users to manipulate models in real-time, in particular to see the effects of hyperparameter choices on models as they are being training. This exploration helps users gain an understanding of the optimization process, and allows them to better align model performance with their needs. It also supports "visual explanations" where under appropriate conditions, output behaviors can be attributed to patterns of activity in the network. So far visual explanations have been applied to convolution networks working on images. In future we would like to extend the approach to more general networks. 

Technical Courses

At least three of your four technical courses should be chosen from the list below. The remaining technical courses should be chosen from your own or another MEng area of concentration within the EECS Department.

Fall 2017

Spring 2018 (updated as of 10/20/17)

Note: The courses listed here are not guaranteed to be offered, and the course schedule may change without notice. Refer to the UC Berkeley Course Schedule for further enrollment information.