Archana Ganapathi's research while affiliated with Splunk and other places

Publications (26)

Conference Paper
Data quality issues pose a significant barrier to operationalizing big data. They pertain to the meaning of the data, the consistency of that meaning, the human interpretation of results, and the contexts in which the results are used. Data quality issues arise after organizations have moved past clear-cut technical solutions to early bottlenecks i...
Conference Paper
Data exploration is largely manual and labor intensive. Although there are various tools and statistical techniques that can be applied to data sets, there is little help to identify what questions to ask of a data set, let alone what domain knowledge is useful in answering the questions. In this paper, we study user queries against production data...
Conference Paper
Organizations are constantly overwhelmed with challenges of big data management resulting from data floods produced by various computer systems and components. Mining this data deluge can produce invaluable insights for the busi- ness and for operations. However, the organization must first tackle issues of effcient data storage and retrieval. In t...
Article
Some of the significant advancements and challenges faced in log analysis are discussed. The content and format of logs can vary widely from one system to another and among components within a system. The simplest and most common use for a debug log is to grep for a specific message. It is found that it is difficult to figure out what to search for...
Article
Full-text available
Web Analytics has become a critical component of many business decisions. With an ever growing number of transactions happening through web interfaces, the abil-ity to understand and introspect web site activity is criti-cal. In this paper, we describe the importance and intri-cacies of summarization for analytics and report genera-tion on web log...
Conference Paper
Full-text available
MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. In this paper, we buil...
Article
Full-text available
Modern data systems comprise of heterogeneous and distributed components, making them difficult to manage piece-wise, let alone as a whole. Furthermore, the scale, complexity, and growth rate of these systems render any heuristic and rule-based system management approaches insufficient. In response to these challenges, statistics-based techniques f...
Conference Paper
Most modern systems generate abundant and diverse log data. With dwindling storage costs, there are fewer reasons to summarize or discard data. However, the lack of tools to efficiently store and cross-correlate heterogeneous datasets makes it tedious to mine the data for analytic insights. In this paper, we present Splunk, a semi-structured time s...
Conference Paper
Full-text available
Compression enables us to shift resource bottlenecks between IO and CPU. In modern datacenters, where energy efficiency is a growing concern, the benefits of using compression have not been completely exploited. As MapReduce represents a common computation framework for Internet datacenters, we develop a decision algorithm that helps MapReduce user...
Conference Paper
As data warehousing technology gains a ubiquitous presence in business today, companies are becoming increasingly reliant upon the information contained in their data warehouses to inform their operational decisions. This information, known as business intelligence (BI), traditionally has taken the form of nightly or monthly reports and batched ana...
Conference Paper
Full-text available
A recent trend for data-intensive computations is to use pay-as-you-go execution environments that scale transparently to the user. However, providers of such environments must tackle the challenge of configuring their system to provide maximal performance while minimizing the cost of resources used. In this paper, we use statistical models to pred...
Conference Paper
Full-text available
One of the most challenging aspects of managing a very large data warehouse is identifying how queries will behave before they start executing. Yet knowing their performance characteristics - their runtimes and resource usage - can solve two important problems. First, every database vendor struggles with managing unexpectedly long-running queries....
Conference Paper
Full-text available
Multicore architectures have become so complex and di- verse that there is no obvious path to achieving good per- formance. Hundreds of code transformations, compiler flags, architectural features and optimization parameters result in a search space that can take many machine- months to explore exhaustively. Inspired by successes in the systems com...
Article
ABSTRACT We explore how,to manage,database workloads,that contain a mix- ture of OLTP-like queries that run for milliseconds as well as busi- ness intelligence queries and maintenance,tasks that last f or hours. As data warehouses,grow in size to petabytes and complex,ana- lytic queries play a greater role in day-to-day business ope rations, factor...
Conference Paper
PC users have started viewing crashes as a fact of life rather than a problem. To improve operating system dependability, systems designers and programmers must analyze and understand failure data. In this paper, we analyze Windows XP kernel crash data collected from a population of volunteers who contribute to the Berkeley Open Infrastructure for...
Conference Paper
Reliability is a rapidly growing concern in contemporary personal computer (PC) industry, both for computer users as well as product developers. To improve dependability, systems designers and programmers must consider failure and usage data for operating systems as well as applications. In this paper, we discuss our experience with crash and usage...
Conference Paper
Full-text available
Software configuration problems are a major source of failures in computer systems. In this paper, we present a new framework for categorizing configuration problems. We apply this categorization to Windows registry-related problems obtained from various internal as well as external sources. Although infrequent, registry-related problems are diffic...
Conference Paper
We describe the architecture, operational practices, and failure characteristics of three very large-scale Internet services. Our research on architecture and operational practices took the form of interviews with architects and operations staff at those (and several other) services. Our research on component and service failure took the form of ex...
Article
We present a preliminary exploration of application crash events in Win32 systems by automatic clustering. The crash events, collected from workstations in the UC Berkeley electrical engineering and computer science department, identify the crashed application and DLL, and the time of the crash event. We use these identifying features to augment th...
Article
Failure data analysis is an essential component of computer systems research. Unfortunately, academic researchers are often hindered from making progress on their research agenda due to the lack of publicly-available real-world failure data. To address the gap between data abundance in industry and data scarcity in academia, we propose several tool...
Article
Workload characterization and generation are essential tools to assist in building and maintaining web services. We discuss a framework that allows us to take advantage of trace data, process it using Machine Learning algorithms, and generate workload that produces specific effects on the target system. We performed clustering analysis to character...
Article
predicament. Given the diverse need to improve availability in Internet Services (Oppenheimer & Patterson), it is valuable to develop a language that helps users specify dependability requirements as well as failure models of system components including hardware, software, networking and human operators. The long-term goal would be to develop compi...
Article
Computer systems research is awash in data, but we do not reap the fullest possible benefit. A va- riety of data issues, such as proprietary data sets, privacy concerns, uncertain data providence, pri- ority questions, incompatible formats, and simply lost data make it dicult to build on experiments performed by previous systems researchers. None o...
Article
Abstract Reliability is a rapidly growing concern in contemporary Personal Computer (PC) industry, both for computer users as well as product developers. To improve dependability, systems designers and programmers must consider failure and usage data for operating systems as well as applications. In this paper, we analyze crash data from Windows ma...

Citations

... Techniques have been proposed to identify and resolve data quality issues across multiple data sources, such as missing or inconsistent values [4]. Data quality issues pose a significant barrier to operationalizing big data and can lead to uncertainty and disruptions if not appropriately addressed [5]. ...
... Meanwhile, research communities in Visual Analytics and Human-Computer Interaction have investigated new techniques and systems that can help end-users in building ML models [18,48,54]. Such approaches are largely based on our theoretical understanding of how people explore information and gain insights using interactive systems, including sense-making loop [42], information foraging [41], and characterization of tasks and queries in Exploratory Visual Analysis [4,12,49]. Understanding how users use AutoML systems to perform their exploratory model building in constructing, understanding the resultant models, and recognizing the implications of their model choices is an important topic in visual analytic because the way in which a user builds a ML model is highly exploratory and can be directly supported by visual interfaces [16,31]. Gaining a deeper understanding of how users think with AutoML systems and where things may fail is crucial for determining how we can better support their ML model building process with visual analytic solutions. ...
... Actually, although real-world applications usually prefer row stores for transactions due to the given relational model, they may desire another specialized data format for analytical queries. For instance, complementary to column store, databases with novel data models are increasingly popular for analytical workloads, e.g., graph databases [15,33], spatial databases [76,173], and time series databases [24,140]. To absorb the advancements of these new databases, a well-customized HTAP system demands a re-design of data synchronization approaches, delta buffer, meta-data, execution engine, and query optimizer. ...
... Schema-on-read (Deutsch, 2013), (Mendelevitch, 2013), (Jacobsohn and Delurey, 2014) is an agile approach to data storage and retrieval that defers investments in data organization until production queries need to be run by working with data directly in native form. Schema-on-read functions have been implemented in a wide range of analytical systems including Hadoop (Hadoop Team, 2015), (Schau, 2015), Splunk (Bitincka et al., 2012), Apache Spark (Spark Team, 2015), Apache Flink (Markl, 2014), and even relational databases (Liu and Gawlick, 2015). It is also possible to use machine learning tools to extract schemas from source data (Yeh et al., 2013). ...
... New tools and systems that would help researchers share tools and results, and to validate previous results are now possible. Suggestions such as Dataforge [25], paralleling Sourceforge, and the use of cloud computing [26] are very good examples of approaches that would help researchers in experimental algorithmics and systems research. ...
... A large number of approaches have been proposed to isolate critical kernel code from less critical kernel extensions, as existing extension mechanisms were found to threaten system stability in case of misbehaving extensions [44], [45], [46], [47], [48], [49]. Similar to privilege separation, most work in this field has focused on how to establish isolation between the kernel and its extensions [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], but only little work considers the problem of identifying what to isolate for achieving improved fault tolerance at an acceptable degree of performance degradation. ...
... They also discussed various traits of web analytics, its application areas (Burby and Brown, 2007). Ganapathi & Zhang showed in his paper the art of data summarization and its applications in managing systems effectively (Ganapathi & Zhang, 2011). Rapoza stated the web analytics from new perspectives and discussed innovative features of web analytics and its pros and cons with a new look (Rapoza, 2010). ...
... This may cause additional space overhead, which increase the infrastructure requirements and costs. Therefore, managing these mixed workloads (OLTP and OLAP) in the data management systems is a big challenge for current disk-based DBMSs [5]. ...
... Equipment failure prediction can be primarily classified into prediction methods based on mechanism models and data-driven prediction methods. Wang et al. [11] proposed a fault detection method for cloud-computing systems based on adaptive monitoring. By monitoring various attributes of the system, a correlation analysis was conducted to establish the correlation between each measure, and key measures were selected for system monitoring. ...
... Compression techniques in Hadoop delegate the computation load from I/O handlers to the processor. Compression improves both time efficiency and network traffic if I/O is the latency and network traffic bottleneck [6]. A set of configuration parameters in Hadoop control compression on a per-job basis. ...