.. upathlib documentation master file, created by sphinx-quickstart on Fri Nov 25 22:11:50 2022. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. .. testsetup:: * from upathlib import LocalUpath .. testcleanup:: LocalUpath('/tmp/abc').rmrf() ======== upathlib ======== (Generated on |today| for upathlib version |version|.) .. automodule:: upathlib :no-members: :no-undoc-members: :no-special-members: To install, do one of the following:: $ pip3 install upathlib $ pip3 install upathlib[gcs] Quickstart ========== Let's carve out a space in the local file system and poke around. >>> from upathlib import LocalUpath >>> p = LocalUpath('/tmp/abc') This creates a :class:`~upathlib.LocalUpath` object ``p`` that points to the location ``'/tmp/abc'``. This may be an existing file, or directory, or may be nonexistent. We know this is a temporary location; to be sure we have a clear playground, let's wipe out anything and everything: >>> p.rmrf() 0 Think ``rm -rf /tmp/abc``. It does just that. The returned `0` means zero files were deleted. Now let's create a file and write something to it: >>> (p / 'x.txt').write_text('first') This creates file ``/tmp/abc/x.txt`` with the content ``'first'``. Note the directory ``'/tmp/abc'`` did not exist before the call. We did not need to "create the parent directory". In fact, ``upathlib`` does not provide a way to do that. In ``upathlib``, "directory" is a "virtual" thing that is embodied by a group of files. For example, if there exist :: /tmp/abc/x.txt /tmp/abc/d/y.data we say there is directories ``'/tmp/abc'`` and ``'/tmp/abc/d'``, but we don't create these "directories" by themselves. These directories come into being if there exist such files. Let's actually create these files: >>> (p / 'x.txt').write_text('second', overwrite=True) >>> (p / 'd' / 'y.data').write_bytes(b'0101') Now let's look into this directory: >>> p.is_dir() True >>> (p / 'd').is_dir() True >>> (p / 'x.txt').is_dir() False >>> (p / 'x.txt').is_file() True We can navigate in the directory. For example, >>> for v in sorted(p.iterdir()): # the sort merely makes the result stable ... print(v) /tmp/abc/d /tmp/abc/x.txt This is only the first level, or "direct children". We can also use "recursive iterdir" to get all files under the directory, descending into subdirectories recursively: >>> for v in sorted(p.riterdir()): # the sort merely makes the result stable ... print(v) /tmp/abc/d/y.data /tmp/abc/x.txt This time only *files* are listed. Subdirectories do not show up because, after all, they are *not real* in ``upathlib`` concept. We can as easily read a file, like >>> (p / 'x.txt').read_text() 'second' Several common file formats are provided out of the box, including text, bytes, json, and pickle, as well as compressed versions by `zlib `_ and `Zstandard `_. Let's do some JSON: >>> pp = p / 'e/f/g/data.json' >>> pp.write_json({'name': 'John', 'age': 38}) We know the JSON file is also a text file, so we can treat it as such: >>> pp.read_text() '{"name": "John", "age": 38}' But usually we prefer to get back the Python object directly: >>> v = pp.read_json() >>> v {'name': 'John', 'age': 38} >>> type(v) We can go "down" the directory tree using ``/``. Conversely, we can go "up" using :meth:`~upathlib.Upath.parent`: >>> pp.path PosixPath('/tmp/abc/e/f/g/data.json') >>> pp.parent LocalUpath('/tmp/abc/e/f/g') >>> pp.parent.parent LocalUpath('/tmp/abc/e/f') >>> pp.parent.parent.is_dir() True >>> pp.parent.parent.is_file() False or the terminal-lovers' ``..``: >>> pp LocalUpath('/tmp/abc/e/f/g/data.json') >>> pp / '..' LocalUpath('/tmp/abc/e/f/g') >>> pp / '..' / '..' LocalUpath('/tmp/abc/e/f') Under the hood, ``/`` delegates to a call to :meth:`~upathlib.Upath.joinpath`: >>> pp.joinpath('../../o/p/q') LocalUpath('/tmp/abc/e/f/o/p/q') Let's see again what we have: >>> sorted(p.riterdir()) [LocalUpath('/tmp/abc/d/y.data'), LocalUpath('/tmp/abc/e/f/g/data.json'), LocalUpath('/tmp/abc/x.txt')] and to get rid of them all: >>> p.rmrf() 3 A nice thing about ``upathlib`` is the "unified" nature across local and cloud storages. Suppose we have set up the environment to use Google Cloud Storage, then we could have started this excercise with >>> from upathlib import GcsBlobUpath >>> p = GcsBlobUpath('gs://my-bucket/tmp/abc') Everything after this would work unchanged. (The printouts would be different at some places, e.g. :class:`LocalUpath` would be replaced by :class:`GcsBlobUpath`.) Upath ===== .. automodule:: upathlib._upath :no-members: :no-undoc-members: :no-special-members: .. autoclass:: upathlib.Upath .. autoclass:: upathlib.FileInfo LocalUpath ========== .. automodule:: upathlib._local :no-members: :no-undoc-members: :no-special-members: .. autoclass:: upathlib.LocalUpath BlobUpath ========= .. automodule:: upathlib._blob GcsBlobUpath ============ .. autoclass:: upathlib.GcsBlobUpath Serializers =========== .. automodule:: upathlib.serializer Using upathlib to implement a "multiplexer" =========================================== :class:`~upathlib.multiplexer.Multiplexer` is a utility for distributing data elements to multiple concurrent or distributed workers. Its implementation relies on the "locking" capability of :class:`~upathlib.Upath`. Suppose we perform some brute-force search on a cluster of machines; there are 1000 grids, and the algorithm takes on one grid at a time. Now, the grid is a "hyper-parameter" or "control parameter" that takes 1000 possible values. We want to distribute these 1000 values to the workers. This is the kind of use cases targeted by ``Multiplexer``. Let's show its usage using local data and multiprocessing. (For real work, we would use cloud storage and a cluster of machines.) First, create a ``Multiplexer`` to hold the values to be distributed: >>> from upathlib import LocalUpath >>> from upathlib.multiplexer import Multiplexer >>> p = LocalUpath('/tmp/abc/mux') >>> p.rmrf() 0 >>> hyper = Multiplexer.new(range(20), p) >>> len(hyper) 20 Next, design an interesting worker function: >>> import multiprocessing, random, time >>> >>> def worker(mux_id): ... for x in Multiplexer(mux_id): ... time.sleep(random.uniform(0.1, 0.2)) # doing a lot of things ... print(x, 'done in', multiprocessing.current_process().name) Back in the main process, >>> mux_id = hyper.create_read_session() >>> tasks = [multiprocessing.Process(target=worker, args=(mux_id,)) for _ in range(5)] >>> for t in tasks: ... t.start() >>> 2 done in Process-13 0 done in Process-11 1 done in Process-12 4 done in Process-15 3 done in Process-14 6 done in Process-11 7 done in Process-12 8 done in Process-15 5 done in Process-13 9 done in Process-14 12 done in Process-15 13 done in Process-13 11 done in Process-12 10 done in Process-11 14 done in Process-14 15 done in Process-15 18 done in Process-11 16 done in Process-13 17 done in Process-12 19 done in Process-14 >>> >>> for t in tasks: ... t.join() >>> hyper.done(mux_id) True >>> hyper.destroy() >>> .. autoclass:: upathlib.multiplexer.Multiplexer Indices ======= * :ref:`genindex` * :ref:`modindex`