Archana Ganapathi's research works | Splunk, San Francisco and other places

What is this page?

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Data quality: Experiences and lessons from operationalizing big data

Conference Paper

Dec 2016

Data quality issues pose a significant barrier to operationalizing big data. They pertain to the meaning of the data, the consistency of that meaning, the human interpretation of results, and the contexts in which the results are used. Data quality issues arise after organizations have moved past clear-cut technical solutions to early bottlenecks i...

Building blocks for exploratory data analysis tools

Conference Paper

Aug 2013

Data exploration is largely manual and labor intensive. Although there are various tools and statistical techniques that can be applied to data sets, there is little help to identify what questions to ask of a data set, let alone what domain knowledge is useful in answering the questions. In this paper, we study user queries against production data...

Experiences with workload management in splunk

Conference Paper

Sep 2012

Organizations are constantly overwhelmed with challenges of big data management resulting from data floods produced by various computer systems and components. Mining this data deluge can produce invaluable insights for the busi- ness and for operations. However, the organization must first tackle issues of effcient data storage and retrieval. In t...

Advances and Challenges in Log Analysis

Article

Feb 2012

Some of the significant advancements and challenges faced in log analysis are discussed. The content and format of logs can vary widely from one system to another and among components within a system. The simplest and most common use for a debug log is to grep for a specific message. It is found that it is difficult to figure out what to search for...

Web analytics and the art of data summarization

Article

Full-text available

Oct 2011

Web Analytics has become a critical component of many business decisions. With an ever growing number of transactions happening through web interfaces, the abil-ity to understand and introspect web site activity is criti-cal. In this paper, we describe the importance and intri-cacies of summarization for analytics and report genera-tion on web log...

The Case for Evaluating MapReduce Performance Using Workload Suites

Conference Paper

Full-text available

Jul 2011

MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. In this paper, we buil...

Challenges and Opportunities for Managing Data Systems Using Statistical Models.

Article

Full-text available

Jan 2011

Modern data systems comprise of heterogeneous and distributed components, making them difficult to manage piece-wise, let alone as a whole. Furthermore, the scale, complexity, and growth rate of these systems render any heuristic and rule-based system management approaches insufficient. In response to these challenges, statistics-based techniques f...

Optimizing data analysis with a semi-structured time series database

Conference Paper

Oct 2010

Most modern systems generate abundant and diverse log data. With dwindling storage costs, there are fewer reasons to summarize or discard data. However, the lack of tools to efficiently store and cross-correlate heterogeneous datasets makes it tedious to mine the data for analytic insights. In this paper, we present Splunk, a semi-structured time s...

To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency

Conference Paper

Full-text available

Aug 2010

Compression enables us to shift resource bottlenecks between IO and CPU. In modern datacenters, where energy efficiency is a growing concern, the benefits of using compression have not been completely exploited. As MapReduce represents a common computation framework for Internet datacenters, we develop a decision algorithm that helps MapReduce user...

Managing Dynamic Mixed Workloads for Operational Business Intelligence

Conference Paper

Mar 2010

As data warehousing technology gains a ubiquitous presence in business today, companies are becoming increasingly reliant upon the information contained in their data warehouses to inform their operational decisions. This information, known as business intelligence (BI), traditionally has taken the form of nightly or monthly reports and batched ana...

Statistics-driven workload modeling for the Cloud

Conference Paper

Full-text available

Jan 2010

A recent trend for data-intensive computations is to use pay-as-you-go execution environments that scale transparently to the user. However, providers of such environments must tackle the challenge of configuring their system to provide maximal performance while minimizing the cost of resources used. In this paper, we use statistical models to pred...

Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning

Conference Paper

Full-text available

Mar 2009

One of the most challenging aspects of managing a very large data warehouse is identifying how queries will behave before they start executing. Yet knowing their performance characteristics - their runtimes and resource usage - can solve two important problems. First, every database vendor struggles with managing unexpectedly long-running queries....

A Case for Machine Learning to Optimize Multicore Performance

Conference Paper

Full-text available

Jan 2009

Multicore architectures have become so complex and di- verse that there is no obvious path to achieving good per- formance. Hundreds of code transformations, compiler flags, architectural features and optimization parameters result in a search space that can take many machine- months to explore exhaustively. Inspired by successes in the systems com...

Managing operational business intelligence workloads

Article

Jan 2009

ABSTRACT We explore how,to manage,database workloads,that contain a mix- ture of OLTP-like queries that run for milliseconds as well as busi- ness intelligence queries and maintenance,tasks that last f or hours. As data warehouses,grow in size to petabytes and complex,ana- lytic queries play a greater role in day-to-day business ope rations, factor...

Using Machine Learning to Auto-tune a Stencil Code on a Multicore Architecture

Conference Paper

Jan 2008

Windows XP Kernel Crash Analysis.

Conference Paper

Jan 2006

PC users have started viewing crashes as a fact of life rather than a problem. To improve operating system dependability, systems designers and programmers must analyze and understand failure data. In this paper, we analyze Windows XP kernel crash data collected from a population of volunteers who contribute to the Berkeley Open Infrastructure for...

Crash data collection: a Windows case study

Conference Paper

Jan 2005

Reliability is a rapidly growing concern in contemporary personal computer (PC) industry, both for computer users as well as product developers. To improve dependability, systems designers and programmers must consider failure and usage data for operating systems as well as applications. In this paper, we discuss our experience with crash and usage...

Why PCs are fragile and what we can do about it: a study of Windows registry problems

Conference Paper

Full-text available

Jan 2004

Software configuration problems are a major source of failures in computer systems. In this paper, we present a new framework for categorizing configuration problems. We apply this categorization to Windows registry-related problems obtained from various internal as well as external sources. Although infrequent, registry-related problems are diffic...

Why Do Internet Services Fail, and What Can Be Done About It?

Conference Paper

Jan 2003

We describe the architecture, operational practices, and failure characteristics of three very large-scale Internet services. Our research on architecture and operational practices took the form of interviews with architects and operations staff at those (and several other) services. Our research on component and service failure took the form of ex...

Automated Clustering of Win32 Applications Based on Failure Behavior

Article

We present a preliminary exploration of application crash events in Win32 systems by automatic clustering. The crash events, collected from workstations in the UC Berkeley electrical engineering and computer science department, identify the crashed application and DLL, and the time of the crash event. We use these identifying features to augment th...

Tools and Techniques for Failure Data Collection and Analysis

Article

Archana Ganapathi

Failure data analysis is an essential component of computer systems research. Unfortunately, academic researchers are often hindered from making progress on their research agenda due to the lack of publicly-available real-world failure data. To address the gap between data abundance in industry and data scarcity in academia, we propose several tool...

Towards web server workload characterization and generation

Article

Archana Ganapathi

Workload characterization and generation are essential tools to assist in building and maintaining web services. We discuss a framework that allows us to take advantage of trace data, process it using Machine Learning algorithms, and generate workload that produces specific effects on the target system. We performed clustering analysis to character...

Error-Recovery Language for Internet Systems

Article

Archana Ganapathi

predicament. Given the diverse need to improve availability in Internet Services (Oppenheimer & Patterson), it is valuable to develop a language that helps users specify dependability requirements as well as failure models of system components including hardware, software, networking and human operators. The long-term goal would be to develop compi...

A Case for DataForge: A SourceForge For Experimental Data

Article

Computer systems research is awash in data, but we do not reap the fullest possible benefit. A va- riety of data issues, such as proprietary data sets, privacy concerns, uncertain data providence, pri- ority questions, incompatible formats, and simply lost data make it dicult to build on experiments performed by previous systems researchers. None o...

Why Does Windows Crash?.

Article

Archana Ganapathi

Abstract Reliability is a rapidly growing concern in contemporary Personal Computer (PC) industry, both for computer users as well as product developers. To improve dependability, systems designers and programmers must consider failure and usage data for operating systems as well as applications. In this paper, we analyze crash data from Windows ma...

Failure Analysis of Internet Services

Article