apache - Hadoop DistributedCache caching files without absolute path? -
i in process of migrating yarn , seems behavior of distributedcache changed.
previously, add files cache follows:
for (string file : args) { path path = new path(cache_root, file); uri uri = new uri(path.touri().tostring()); distributedcache.addcachefile(uri, conf); }
the path typically
/some/path/to/my/file.txt
which pre-exists on hdfs , end in distributedcache
/$distro_cache/some/path/to/my/file.txt
i symlink in current working directory , use distributedcache.getlocalcachefiles()
with yarn, seems file instead ends in cache as:
/$distro_cache/file.txt
ie, 'path' part of file uri got dropped , filename remains.
how work different absolute paths ending same filename? consider following case:
distributedcache.addcachefile("some/path/to/file.txt", conf); distributedcache.addcachefile("some/other/path/to/file.txt", conf);
arguably use fragments:
distributedcache.addcachefile("some/path/to/file.txt#file1", conf); distributedcache.addcachefile("some/other/path/to/file.txt#file2", conf);
but seems unnecessarily harder manage. imagine scenario command-line arguments, somehow need manage 2 filenames, although different absolute paths clash in distributedcache , therefore need re-map these filenames fragments , propagate such rest of program?
is there easier way manage this?
try add files job
it's how you're configuring job , accessing them in mapper.
when you're setting job you're going like
job.addcachefile(new path("cache/file1.txt").touri()); job.addcachefile(new path("cache/file2.txt").touri());
then in mapper code urls going stored in array can accessed so.
uri file1uri = context.getcachefiles()[0]; uri file2uri = context.getcachefiles()[1];
hope you.
Comments
Post a Comment