~ Project ~

Feed

Below are the recommended topics and some related resources/papers.

1.Scheduling and Resource allocation in Big Data Systems

R. Grandl et al., "Altruistic Scheduling in Multi-Resource Clusters", OSDI 2016.
R. Grandl et al., "Packing and Dependency-Aware Scheduling for Data-Parallel Clusters", OSDI 2016.
I. Gog et al., "Firmament: Fast, Centralized Cluster Scheduling at Scale", OSDI 2016.
J. Jiang et al., "Symbiosis: Network-Aware Task Scheduling in Data-Parallel Frameworks", Infocom 2016.
P. Delgado et al, "Job-Aware Scheduling in Eagle: Divide and Stick to Your Probes," ACM SoCC 2016.
Y. Yang et al, "TR-Spark: Transient Computing for Big Data Analytics," ACM SoCC 2016.

2. Wide-area/Geo-distributed Big Data Analytics

K. Kloudas et al., "Pixida: Optimizing Data Parallel Jobs in Wide-Area Data Analytics", VLDB 2015.
Vulimiri et al., "Global Analytics in the Face of Bandwidth and Regulatory Constraints", NSDI, 2015.
Vulimiri et al., "WANalytics: Analytics for a geo-distributed data-intensive world", CIDR, 2015.
Pu et al., "Low-Latency Analytics of Geo-Distributed Data in the Wide Area", SIGCOMM, 2015.
Viswanathan et al., "Clarinet: WAN-Aware Optimization for Analytics Queries", OSDI, 2016.

3. Distributed Systems for Deep Learning

Chilimbi et al., "Project Adam: Building an Efficient and Scalable Deep Learning Training System", OSDI, 2014.
Martin Abadi et al, "TensorFlow: A System for Large-Scale Machine Learning", OSDI 2016.
Tim Hunter, "TensorFrames -- Google TensorFlow on Apache Spark", Spark Meetup, June 2016.
Caffee-on-Spark: http://yahoohadoop.tumblr.com/post/139916563586/caffeonspark-open-sourced-for-distributed-deep.
Distributed (Deep) Machine Learning Community (DMLC), https://github.com/dmlc
MXNet: Flexible and Efficient Library for Deep Learning, https://github.com/dmlc/mxnet
Deeplearning4J.org, "Running Deep Learning on Distributed GPUs with Spark," http://deeplearning4j.org/spark-gpus
Alex Chen et al, "Distributed Neural Networks with GPUs in the AWS Cloud", http://techblog.netflix.com/2014/02/distributed-neural-networks-with-gpus.html.

4. Distributed Machine Learning Platforms

F. Niu et al., "Hogwild: A lock-free approach to parallelizing stochastic gradient descent," NIPS 2011.
J. Dean et al, "Large scale distributed deep networks," NIPS 2012.
Li et al., "Scaling Distributed Machine Learning with the Parameter Server", OSDI, 2014.
Li et al., "Communication Efficient Distributed Machine Learning with the Parameter Server", NIPS 2014.
Eric P. Xing et al., "Petuum: A new platform for Distributed Machine Learning on Big Data", IEEE Transactions on Big Data, 2015.
T. Chen et al., "MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems", NIPS Workshop on Machine Learning Systems (LearnSys), 2015.
MXNet: Flexible and Efficient Library for Deep Learning, https://github.com/dmlc/mxnet
Gonzalez J. E. et al., "Asynchronous Complex Analytics in a Distributed Dataflow Architecture", arXiv preprint arXiv:1510.07092 (2015).
X. Pan et al, "Cyclades: Conflict-free Asynchronous Machine Learning," NIPS 2016.

5. Stateful Dataflow

D. Murray, "Incremental., iterative data processing with timely dataflow", Communications of ACM 2016.
Murray et al., "Naiad: A Timely Dataflow System", SOSP, 2013.
P. Pietzuch et al., "Stateful Distributed Dataflow Graphs",
R. C. Fernandez et al., "Making state explicit for imperative big data processing". In USENIX ATC, 2014.
R. C. Fernandez et al., "Integrating scale out and fault tolerance in stream processing using operator state management". ACM SIGMOD 2013.

6. Stream Analytics

Lin et al., "StreamScope: Continuous Reliable Distributed Processing of Big Data Streams", NSDI, 2016.
Kulkarni et al., "Twitter Heron: Stream Processing at Scale", SIGMOD, 2015.
Toshniwal et al., "Storm @Twitter", SIGMOD, 2014.
Rabkin et al., "Aggregation and Degradation in JetStream: Streaming analytics in the wide area", NSDI, 2014.
T. Condie, et al., "MapReduce Online,", NSDI 2010.
A. Alexandrov et al., "The Stratosphere platform for Big Data Analytics", VLDB 2014.
R. C. Fernandez et al., "Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management". In SIGMOD, 2013.

7. SQL-based Big Data Systems

A. Floratou et al., "SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures", VLDB 2014.
Kornacker et al., "Impala: A Modern, Open-Source SQL Engine for Hadoop", CIDR 2015.
Huai et al., "Major technical advancements in Apache Hive", SIGMOD, 2014.
Armburst et al., "Spark SQL: Relational Data Processing in Spark", SIGMOD, 2015
L. Chang, "Presto: Interacting with petabytes of data at Facebook", blog by L.Chang, 2013.
W. Alkowaileet et al., "Large-scale Complex Analytics on Semi-structured Datasets using AsterixDB and Spark", VLDB 2016.
A. Alexandrov et al., "Emma in Action: Declarative Dataflows for Scalable Data Analysis", SIGMOD 2016.

8. Systems for Big Graph Analytics

Carlos H.C. Teixeira et al., "Arabesque: A system for distributed graph mining", SOSP 2015.
Amitabha Roy et al., "Chaos: Scale-out Graph Processing from Secondary Storage", SOSP 2015.
Anessh Sharma et al., "GraphJet: Real-Time Content Recommendations at Twitter", VLDB 2016.
D. Yan et al., "Big Graph Analytics Systems", SIGMOD 2016
D. Yan et al., "A General-Purpose Query-Centric Framework for Querying Big Graphs", VLDB 2016.

9. Approximation Query

S. Agarwal et al., "BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data", Eurosys, 2013.
S. Agarwal et al., "Knowing when you're wrong: building fast and reliable approximate query processing systems," SIGMOD 2014.
S. Agarwal et al., "Succinct: Enabling Queries on Compressed Data", NSDI, 2015.
G. Ananthanarayanan, et al. "GRASS: trimming stragglers in approximation analytics," NSDI 2014.

10. Monitoring and Diagnosis in Data Center Scaling Computing

Mace et al., "Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems",SOSP, 2015.
M. Moshref et al., "Trumpet: Timely and Precise Triggers in Data Centers", Sigcomm 2016.
R. Sambasivan et al, "Principled workflow-centric tracing of distributed systems," ACM SoCC 2016.
M. Leich "Runtime Analysis of Distributed Data Processing Programs," VLDB 2014.
E, Coppa et al., "On Data Skewness, Stragglers and MapReduce Progress Indicators," ACM SoCC 2015.

11. Matrix Computations on Distributed Cluster

R. B. Zadeh et al., "Matrix Computations and Optimization in Apache Spark", KDD 2016.
A. Elgohary et al., "Compressed Linear Algebra for Large-scale Machine Learning", VLDB 2016.
M. Li et al., "Cuckoo Linear Algebra", KDD 2015.

12. Management for Data-Center Networks

A. Singh et al, "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network," SIGCOMM 2015.
R. Govindan et al., "Evolve or Die: High-Availabiity Design Principles Drawn from Google’s Network Infrastructure", SIGCOMM 2016.
Y. W. Sung et al., "Robotron: Top-down Network Management at Facebook Scale", SIGCOMM 2016.
Chow, Michael, et al. "The Mystery Machine: End-to-end performance analysis of largescale Internet service," OSDI 2014.

13. Traffic Flow Scheduling for Data Center Networks

M. Chowdhury, I. Stoica, "Managing data transfers in computer clusters with orchestra", SIGCOMM 2011.
M. Chowdhury, I. Stoica, "Efficient coflow scheduling with Varys", SIGCOMM 2014.
F.R. Dogar et al., "Decentralized Task-Aware Scheduling for Data Center Networks", SIGCOMM 2014.
M. Chowdhury, I. Stoica, "Efficient Coflow Scheduling Without Prior Knowledge". SIGCOMM 2015.
Y. Zhao et al., "RAPIER: Integrating Routing and Scheduling for Coflow-aware Data Center Networks", Infocom 2016.
Y. Li et al., "Efficient online coflow routing and scheduling", ACM Mobihoc 2016.
H. Zhang et al., "CODA: Toward Automatically Identifying and Scheduling COflows in the DArk", ACM SIGCOMM 2016.
L. Chen et al., "Scheduling Mix-flows in Commodity Datacenters with Karuna", ACM SIGCOMM 2016.
P. Wang et al, "Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks," ACM SoCC 2016.

14. Performance Prediction for Large-scale Analytics

S. Venkataraman et al., "Ernest: Efficient Performance Prediction for Large-scale Advanced Analytics", NSDI 2016.
K. Ousterhout, "Re-architecting Spark for Performance Understandability", Spark Summit 2016 talk
K. Ousterhout et al., "Making Sense of Performance in Data Analytics Frameworks", NSDI 2015.
D. Crankshaw et al, "The missing piece in complex analytics: low latency, scalable model management and serving with Velox", CIDR 2015.
N. J. Yadwadkar et al, "Wrangler: Predictable and Faster Jobs using Fewer Resources", ACM SoCC 2014.
E. Sparks et al, "Automating model search for large-scale machine learning," ACM SoCC 2015.
K. Rajan et al,"PerfOrator: eloquent performance models for Resource Optimization," ACM SoCC 2016.
N. J. Yadwadkar et al., "Katz: Faster Jobs in Distributed Data Processing using Multi-Task Learning", SDM 2015.

15. VM/ Cloud Resource Management/Scheduling

W. Lang et al., "Not for the Timid: On the impact of Aggressive Over-booking in the Cloud", VLDB 2016.
C. Fuerst et al., "Kraken: Online and Elastic Resource Reservations for Multi-tenant Datacenters", Infocom 2016.
Z. Han et al., "Dynamic Virtual Machine Management via Approximate Markov Decision Process", Infocom 2016.
J. Mace et al., "2DFQ: Two-dimensional Fair Queueing for Multi-Tenant Cloud Services", SIGCOMM 2016.
J. Chaderi, "Randomized Algorithms for Scheduling VMs in the Cloud", Infocom 2016.

16. Analyzing Dynamic/Time-evolving Graphs in Large-Computing Clusters

I. Stoica, "Time-evolving Graph Processing on Commodity Clusters", Spark Summit 2016.
A. P. Iyer et al. "Time-evolving graph processing at scale." Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems. ACM, 2016.
Z. Y. Dong, "A Framework for Computing on Large Dynamic Graphs", arXiv preprint arXiv:1512.01668 (2015).

17. Accelerator/GPU Spark Integration

Di Wu et al, "Deploying Accelerators at DataCenter Scale using Spark," Spark Summit 2016.
Y.Hu et al, "GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale," Spark Summit 2016.
Y.T. Chen et al, "Apache Spark Meets FPGAs: A case study for Next Generation DNA Sequencying Acceleration," HotCloud 2016.
M. Huang et al, "Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale", ACM SoCC 2016.

18. Systems for In-memory Big Data Management and Processing

H. Zhang et al., "In-memory big data management and processing: A survey", IEEE Transactions on Knowledge and Data Engineering 27.7 (2015): 1920-1948.

19. Machine Learning API/Toolkits for Large-scale Clusters

Kraska et al., "MLbase: A Distributed Machine-learning System", CIDR, 2013.
Sparks et al., "MLI: An API for Distributed Machine Learning", ICDM, 2013.
M. Boehm et al., SystemML: Declarative Machine Learning on Spark, VLDB 2016.
Microsoft Distributed Machine Learning Toolkit (DMTK) http://www.dmtk.io/index.html.