-
Notifications
You must be signed in to change notification settings - Fork 231
/
CHANGES.txt
363 lines (286 loc) · 17.9 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
1.5.1 - 2016-07-29
1. Optimized InMemoryDirectoryListCache to use a TreeMap internally instead
of a HashMap, since most of its usage is "listing by prefix"; aside
from some benefits for normal list-supplementation, glob listings
with a large number of subdirectories should be orders of magnitude
faster.
1.5.0 - 2016-07-15
1. Changed the single-argument constructor
GoogleCloudStorageFileSystem(GoogleCloudStorage) to inherit the inner
GoogleCloudStorageOptions from the passed-in GoogleCloudStorage rather
than simply falling back to default GoogleCloudStorageFileSystemOptions.
2. Changed RetryHttpInitializer to treat HTTP code 429 as a retriable
error with exponential backoff instead of erroring out.
3. Added a new output stream type which can be used by setting:
fs.gs.outputstream.type=SYNCABLE_COMPOSITE
With the SYNCABLE_COMPOSITE type, output streams obtained with a call
to FileSystem.create(Path) will have (limited) support for Hadoop's
Syncable interface, where calling hsync() will commit the data written
so far into persistent storage without closing the output stream; once
a writer calls hsync(), readers will be able to discover and read the
contents of the file written up to the last successful hsync() call
immediately. This feature is implemented using "composite objects" in
Google Cloud Storage, and thus has a current hard limit of 1023 calls
to hsync() over the lifetime of the stream after which subsequent hsync()
calls will throw a CompositeLimitExceededException, but will still
allow writing additional data and closing the stream without losing data
despite a failed hsync().
4. Added several optimizations and options controlling the behavior of
the GoogleHadoopFSInputStream:
-Removed internal prepopulated ByteBuffer inside GoogleHadoopFSInputStream
in favor of non-blocking buffering in the lower layer channel; this
improves performance for independent small reads without noticeable
impact on large reads. Old behavior can be specified by setting
fs.gs.inputstream.internalbuffer.enable=true (default=false)
-Added option to save an extra metadata GET on first read of a channel
if objects are known not to use the 'Content-Encoding' header. To
use this optimization set:
fs.gs.inputstream.support.content.encoding.enable=false (default=true)
-Added option to save an extra metadata GET on FileSystem.open(Path) if
objects are known to exist already, or the caller is resilient to delayed
errors for nonexistent objects which occur at read time, rather than
immediately on open(). To use this optimization set:
fs.gs.inputstream.fast.fail.on.not.found.enable=false (default=true)
-Added support for "in-place" small seeks by defining a byte limit where
forward seeks smaller than the limit will be done by reading/discarding
bytes rather than opening a brand-new channel. This is set to 8MB by
default, which is approximately the point where the cost of opening a
new channel is equal to the cost of reading/discarding the bytes in-place.
Disable by setting the value to 0, or adjust to other thresholds:
fs.gs.inputstream.inplace.seek.limit=0 (default=8388608)
1.4.5 - 2016-03-24
1. Add support for paths that cannot be parsed by Java's URI.create method.
This support is off by default, but can be enabled by setting Hadoop
configuration key fs.gs.path.encoding to the string uri-path. The
current behavior, and default value of fs.gs.path.encoding, is 'legacy'.
The new path encoding scheme will become the default in a future release.
2. VerificationAttributes are now exposed in GoogleCloudStorageItemInfo. The
current support is limited to reading these attributes from what was
computed by GCS server side. A future release will add support for
specifying VerificationAttributes in CreateOptions when creating new
objects.
1.4.4 - 2016-02-02
1. Add support for JSON keyfiles via a new configuration key:
google.cloud.auth.service.account.json.keyfile. This key should
point to a file that is available locally to jobs to clusters.
1.4.3 - 2015-11-12
1. Minor bug fixes and enhancements.
1.4.2 - 2015-09-15
1. Added checking in GoogleCloudStorageImpl.createEmptyObject(s) to handle
rateLimitExceeded (429) errors by fetching the fresh underlying info
and ignoring the error if the object already exists with the intended
metadata and size. This fixes an issue which mostly affects Spark:
https://github.com/GoogleCloudPlatform/bigdata-interop/issues/10
2. Added logging in GoogleCloudStorageReadChannel for high-level retries.
3. Added support for configuring the permissions reported to the Hadoop
FileSystem layer; the permissions are still fixed per FileSystem instance
and aren't actually enforced, but can now be set with:
fs.gs.reported.permissions [default = "700"]
This allows working around some clients like Hive-related daemons and tools
which pre-emptively check for certain assumptions about permissions.
1.4.1 - 2015-07-09
1. Switched from the custom SeekableReadableByteChannel to
Java 7's java.nio.channels.SeekableByteChannel.
2. Removed the configurable but default-constrained 250GB upload limit;
uploads can now exceed 250GB without needing to modify config settings.
3. Added helper classes related to GCS retries.
4. Added workaround support for read retries on objects with content-encoding
set to gzip; such content encoding isn't generally correct to use since
it means filesystem reported bytes will not match actual read bytes, but
for cases which accept byte mismatches, the read channel can now manually
seek to where it left off on retry rather than having a GZIPInputStream
throw an exception for a malformed partial stream.
5. Added an option for enabling "direct uploads" in
GoogleCloudStorageWriteChannel which is not directly used by the Hadoop
layer, but can be used by clients which directly access the lower
GoogleCloudStorage layer.
6. Added CreateBucketOptions to the GoogleCloudStorage interface so that
clients using the low-level GoogleCloudStorage directly can create buckets
with different locations and storage classes.
7. Fixed https://github.com/GoogleCloudPlatform/bigdata-interop/issues/5 where
stale cache entries caused stuck phantom directories if the directories
were deleted using non-Hadoop-based GCS clients.
8. Fixed a bug which prevented the Apache HTTP transport from working with
Hadoop 2 when no proxy was set.
9. Misc updates in library dependencies; google.api.version
(com.google.http-client, com.google.api-client) updated from 1.19.0 to
1.20.0, google-api-services-storage from v1-rev16-1.19.0 to
v1-rev35-1.20.0, and google-api-services-bigquery from v2-rev171-1.19.0
to v2-rev217-1.20.0, and Guava from 17.0 to 18.0.
1.4.0 - 2015-05-27
1. The new inferImplicitDirectories option to GoogleCloudStorage tells
it to infer the existence of a directory (such as foo) when that
directory node does not exist in GCS but there are GCS files
that start with that path (such as as foo/bar). This allows
the GCS connector to be used on read-only filesystems where
those intermediate directory nodes can not be created by the
connector. The value of this option can be controlled by the
Hadoop boolean config option "fs.gs.implicit.dir.infer.enable".
The default value is true.
2. Increased Hadoop dependency version to 2.6.0.
3. Fixed a bug introduced in 1.3.2 where, during marker file creation,
file info was not properly updated between attempts. This lead
to backoff-retry-exhaustion with 412-preconditon-not-met errors.
4. Added support for changing the HttpTransport implementation to use,
via fs.gs.http.transport.type = [APACHE | JAVA_NET]
5. Added support for setting a proxy of the form "host:port" via
fs.gs.proxy.address, which works for both APACHE and JAVA_NET
HttpTransport options.
6. All logging converted to use slf4j instead of the previous
org.apache.commons.logging.Log; removed the LogUtil wrapper which
previously wrapped org.apache.commons.logging.Log.
7. Automatic retries for premature end-of-stream errors; the previous
behavior was to throw an unrecoverable exception in such cases.
8. Made close() idempotent for GoogleCloudStorageReadChannel
9. Added a low-level method for setting Content-Type metadata in the
GoogleCloudStorage interface.
10.Increased default DirectoryListCache TTL to 4 hours, wired out TTL
settings as top-level config params:
fs.gs.metadata.cache.max.age.entry.ms
fs.gs.metadata.cache.max.age.info.ms
1.3.3 - 2015-02-26
1. When performing a retry in GoogleCloudStorageReadChannel, attempts to
close() the underlying channel are now performed explicitly instead of
waiting for performLazySeek() to do it, so that SSLException can be
caught and ignored; broken SSL sockets cannot be closed normally,
and are responsible for already cleaning up on error.
2. Added explicit check of currentPosition == size when -1 is read from
underlying stream in GoogleCloudStorageReadChannel, in case the
stream fails to identify an error case and prematurely reaches
end-of-stream.
1.3.2 - 2015-01-22
1. In the create file path, marker file creation is now configurable. By
default, marker files will not be created. The default is most suitable
for MapReduce applications. Setting fs.gs.create.marker.files.enable to
true in core-site.xml will re-enable marker files. The use of marker files
should be considered for applications that depend on early failing when
two concurrent writes attempt to write to the same file. Note that file
overwrite semantics are preserved with or without marker files, but
failures will occur sooner with marker files present.
1.3.1 - 2014-12-16
1. Fixed a rare NullPointerException in FileSystemBackedDirectoryListCache
which can occur if a directory being listed is purged from the cache
between a call to "exists()" and "listFiles()".
2. Fixed a bug in GoogleHadoopFileSystemCacheCleaner where cache-cleaner
fails to clean any contents when a bucket is non-empty but expired.
3. Fixed a bug in FileSystemBackedDirectoryListCache which caused garbage
collection to require several passes for large directory hierarchies;
now we can successfully garbage-collect an entire expired tree in a
single pass, and cache files are also processed in-place without having
to create a complete in-memory list.
4. Updated handling of new file creation, file copying, and file deletion
so that all object modification requests sent to GCS contain preconditions
that should prevent race-conditions in the face of retried operations.
1.3.0 - 2014-10-17
1. Directory timestamp updating can now be controlled via user-settable
properties "fs.gs.parent.timestamp.update.enable",
"fs.gs.parent.timestamp.update.substrings.excludes". and
"fs.gs.parent.timestamp.update.substrings.includes" in core-site.xml. By
default, timestamp updating is enabled for the YARN done and intermediate
done directories and excluded for everything else. Strings listed in
includes take precedence over excludes.
2. Directory timestamp updating will now occur on a background thread inside
GoogleCloudStorageFileSystem.
3. Attempting to acquire an OAuth access token will be now be retried when
using .p12 or installed application (JWT) credentials if there is a
recoverable error such as an HTTP 5XX response code or an IOException.
4. Added FileSystemBackedDirectoryListCache, extracting a common interface
for it to share with the (InMemory)DirectoryListCache; instead of using
an in-memory HashMap to enforce only same-process list consistency, the
FileSystemBacked version mirrors GCS objects as empty files on a local
FileSystem, which may itself be an NFS mount for cluster-wide or even
potentially cross-cluster consistency groups. This allows a cluster to
be configured with a "consistent view", making it safe to use GCS as the
DEFAULT_FS for arbitrary multi-stage or even multi-platform workloads.
This is now enabled by default for machine-wide consistency, but it is
strongly recommended to configure clusters with an NFS directory for
cluster-wide strong consistency. Relevant configuration settings:
fs.gs.metadata.cache.enable [default: true]
fs.gs.metadata.cache.type [IN_MEMORY (default) | FILESYSTEM_BACKED]
fs.gs.metadata.cache.directory [default: /tmp/gcs_connector_metadata_cache]
5. Optimized seeks in GoogleHadoopFSDataInputStream which fit within
the pre-fetched memory buffer by simply repositioning the buffer in-place
instead of delegating to the underlying channel at all.
6. Fixed a performance-hindering bug in globStatus where "foo/bar/*" would
flat-list "foo/bar" instead of "foo/bar/"; causing the "candidate matches"
to include things like "foo/bar1" and "foo/bar1/baz", even though the
results themselves would be correct due to filtering out the proper glob
client-side in the end.
7. The versions of java API clients were updated to 1.19 derived versions.
1.2.9 - 2014-09-18
1. When directory contents are updated e.g., files or directories are added,
removed, or renamed the GCS connector will now attempt to update a
metadata property on the parent directory with a modification time. The
modification time recorded will be used as the modification time in
subsequent FileSystem#getStatus(...), FileSystem#listStatus and
FileSystem#globStatus(...) calls and is the time as reported by
the system clock of the system that made the modification.
1.2.8 - 2014-08-07
1. Changed the manner in which the GCS connector jar is built to A) reduce
included dependencies to only those parts which are used and B) repackaged
dependencies whose versions conflict with those bundled with Hadoop.
2. Deprecated fs.gs.system.bucket config.
1.2.7 - 2014-06-23
1. Fixed a bug where certain globs incorrectly reported the parent directory
being not found (and thus erroring out) in Hadoop 2.2.0 due to an
interaction with the fs.gs.glob.flatlist.enable feature; doesn't affect
Hadoop 1.2.1 or 2.4.0.
1.2.6 - 2014-06-05
1. When running in hadoop 0.23+ (hadoop 2+), listStatus will now throw a
FileNotFoundException when a non-existent path is passed in.
2. The GCS connector now uses the v1 JSON API when accessing Google
Cloud Storage.
3. The GoogleHadoopFileSystem now treats the parent of the root path as if
it is the root path. This behavior mimics the POSIX behavior of "/.."
being the same as "/".
4. When creating a new file, a zero-length marker file will be created
before the FSDataOutputStream is returned in create(). This allows for
early detection of overwrite conflicts that may occur and prevents
certain race conditions that could be encountered when relying on
a single exists() check.
5. The dependencies on cglib and asm were removed from the GCS connector
and the classes for these are no longer included in the JAR.
1.2.5 - 2014-05-08
1. Fixed a bug where fs.gs.auth.client.file was unconditionally being
overwritten by a default value.
2. Enabled direct upload for directory creation to save one round-trip call.
3. Added wiring for GoogleHadoopFileSystem.close() to call through to close()
its underlying helper classes as well.
4. Added a new batch mode for creating directories in parallel which requires
manually parallelizing in the client. Speeds up nested directory creation
and repairing large numbers of implicit directories in listStatus.
5. Eliminated redundant API calls in listStatus, speeding it up by ~half.
6. Fixed a bug where globStatus didn't correctly handle globs containing '?'.
7. Implemented new version of globStatus which initially performs a flat
listing before performing the recursive glob logic in-memory to
dramatically speed up globs with lots of directories; the new behavior is
default, but can disabled by setting fs.gs.glob.flatlist.enable = false.
1.2.4 - 2014-04-09
1. The value of fs.gs.io.buffersize.write is now rounded up to 8MB if set to
a lower value, otherwise the backend will error out on unaligned chunks.
2. Misc refactoring to enable reuse of the resumable upload classes in other
libraries.
1.2.3 - 2014-03-21
1. Fixed a bug where renaming a directory could cause the file contents to get
shuffled between files when the fully-qualified file paths have different
lengths. Does not apply to renames on files directly, such as when using
a glob expression inside a flat directory.
2. Changed the behavior of batch request API calls such that they are retried
on failure in the same manner as non-batch requests.
3. Eliminated an unnecessary dependency on com/google/protobuf which could
cause version-incompatibility issues with Apache Shark.
1.2.2 - 2014-02-12
1. Fixed a bug where filenames with '+' were unreadable due to premature
URL-decoding.
2. Modified a check to allow fs.gs.io.buffersize.write to be a non-multiple
of 8MB, just printing out a warning instead of check-failing.
3. Added some debug-level logging of exceptions before throwing in cases
where Hadoop tends to swallows the exception along with its useful info.
1.2.1 - 2014-01-23
1. Added CHANGES.txt for release notes.
2. Fixed a bug where accidental URI decoding make it impossible to use
pre-escaped filenames, e.g. foo%3Abar. This is necessary for Pig to work.
3. Fixed a bug where an IOException was thrown when trying to read a
zero-byte file. Necessary for Pig to work.
1.2.0 - 2014-01-14
1. Preview release of GCS connector.