"X"...in a box: python

Showing posts with label python. Show all posts

Friday, February 11, 2022

python snippets

basics

Outside of going to Banas' comprehensive knowledge videos, this is the best I've seen for a comprehensive method of learning.

3 day methodology (7:38) PyCon Canada, 2013. Daniel Moniz went from Java to Python.

Name is special variable (8:42)



def main():

pass



if __name__ == "__main__":

main()

Saturday, September 25, 2021

csv, sms, postgres, dbeaver, python project?

I began this post 10 years ago only thinking about a LAPP or LAMP. Even that was too wide a scope. This post only covers importing a CSV into a stand-alone PostgreSQL instance. We can then take the data for additional processing or practice or whatever. Or for Business Intelligence (BI), since every database is an opportunity for BI meta-analysis. There are several pre-made tools for this, but they may not address our focus. They're still worth a look.

dash and plotly introduction (29:20) Charming Data, 2020. A bit outdated, but the fundamental concepts still helpful.
PBX - on-site or cloud (35:26) Lois Rossman, 2016. Cited mostly for source 17:45 breaks down schematically.
PBX - true overhead costs (11:49) Rich Technology Center, 2020. Average vid, but tells hard facts. Asteriks server ($180) discussed.

file/record plan issues

We can save psql and sql scripts as *.psql and *.sql files respectively. If we use, eg DBeaver, it will generate SQL scripts, which we can subsequently transform, with some work, into PSQL scripts. We may wish to learn more Math also, though this can be difficult on one's own.

Back to our project. At the lowest levels, we need a system which

provides document storage. folders receive a number relevant to their path, and we save documents in them without changing the file names. Of course this has to account for updates of new file addition or any which are retired/deleted.
queries either on meta information (dates), or upon document content (word search)
reports on the database itself; its size, category structures, number of files pointed to, and so on. This is good BI app territory.
back-up configurable or cloud hosting
if office access, possible GUI to delimit employees queries. browser friendly desktop development required for a GUI (PyGObject - GTK, or QT), maybe even Colab.

Python is now flexible enough, especially through become advanced enough to establish client side connections with a database which used to require PHP.


	Grafana apparently works well in a time-series but also could allow us perhaps display counts minutes or numbers of texts tied to various numbers. Pandas and even R have some visualization options.
	Dash is another API similar to Grafana. We can use one or the other, or GraphQL.
	Fluentd can help us if we need to manage several relevant logs around this database project.
	Logs are somewhat specific, but prometheus monitors selected system traits - memory use, HDD access. It can also do this over HTTP in a disributed system, as described here. These kinds of time series integrate well with grafana.

	No particular logo, but SMS related, being able to run some kind of programs against databases to find mean, mode, and standard deviation, even in a table. For example a report that showed average number of texts, word length, top 3 recipients and their tallies, etc. Easily run queries against a number and report to screen. Use browser to access DB.

SMS

A possible project for document managment may be to establish SMS message storage, query and retrieval. The data are row entries from a CSV, rather than disparate files with file names. A reasonable test case.

pacman postgresql dbeaver python. By using dbeaver, we don't have to install PHP and Pgadmin. Saves an extra server.
activate postgreql server
import CSV into Gnumeric and note column names and data types. This can even be codified with UML
populate database with correct column names and formats. Also check data file with text editor to verify comma numbers are accurrate, etc
Python script or manual dBeaver to import CSV file(s).
rclone backup of data files to Cloud

PostgreSQL - install and import CSV (11:19) Railsware Product Academy, 2020. Also PgAdmin install and walkthrough.
PBX - on-site or cloud (35:26) Lois Rossman, 2016. Cited mostly for source 17:45 breaks down schematically.
PBX - true overhead costs (11:49) Rich Technology Center, 2020. Average vid, but tells hard facts. Asteriks server ($180) discussed.

1 & 2. postgresql install/activate (1.5 hrs)

Link: Derek Banas' 3.75 hr PostgreSQL leviathan

See my earlier post on the steps for this. I rated the process at 2 hrs at the time. However, with the steps open in a separate page, 1.5 hrs seems reasonable.

At this point, we should have a PostgreSQL server (cluster) operating on a stand-alone system out of our home directory, with our own username, eg "foo". We can CREATE or DROP databases as needed as we attempt to hit upon a workable configuration.

$ postgres -D /home/foo/postgres/data
$ psql -U foo postgres
postgres=#: create database example;
postgres=#: drop database example;
OR
postgres=#: drop if exists database example;
postgres=#: exit

It would be nice to have a separate cluster for each import attempt, so I don't have to carefully name databases, but I don't want to run "initdb" a bunch of times and create a different data directory for each cluster, which is required. So I'm leaving everything under "postgres" cluster and when I get the final set of databases I like, I'll do a script (eg. *.psql) or some UML so I can re-install the solution after deleting the test project.

3. csv import

Most things on a computer can be done 100 different ways so I started with the simplest -- psql and moved to other ways. This is all covered below. But my first step was to pick a CSV of backed-up SMS's for the process, and clean it.

3a. csv cleaning

I first selected "backup.csv" with 7 columns and 456 rows. I simply opened it with Gnumeric and had a look. I noted advertising in the first and last rows and stripped these rows from Geany. This left me a 7x454, with the first row the column titles. What's interesting here is some of the texts had hard returns in their message contents, so that there were 581 lines. I therefore made a second version, 7x10, with no returns in the message contents; "backup02.csv", for simplest processing.

3b. data types

The efforts below have taught me that we need to understand several basic data types. There were a lot of failures until then. I found this video and this post helpful. Matt was only making a table for a two column price list, but it gives us a start.

And here the more complex scenario Raghav skillfully addresses with multiple tables.

3c. monolithic import attempt

Without considering keys or any normalization, let's try to just bring in an entire CSV, along the lines of Matt's import above. Of note, the times were in the format, "2021-03-07 21:40:25", apparently in the timezone where the call occurred.

$ psql -U foo postgres
postgres=#: create database test01;
postgres=#: \c test01
test01=#: create table try01 (between VARCHAR, name VARCHAR(30), phone VARCHAR(20), content VARCHAR, date TIMESTAMP, mms_subject VARCHAR(10), mms_link VARCHAR(20));
test01=#: select * from try01;

test01=#: COPY try01 FROM '/home/foo/backup01.csv' DELIMITER ',' CSV HEADER;

This accurately brought in and formatted the data, however it also brought a first row that was all dashes, and I don't really need the 1st, 6th, and 7th columns. I looked at the PostgreSQL documentation for COPY. The columns look easiest to fix so i created a smaller table without them.

test01=#: create table try01 (name VARCHAR(30), phone VARCHAR(20), content VARCHAR, date TIMESTAMP);

test01=#: COPY try01 FROM '/home/foo/backup01.csv' DELIMITER ',' CSV HEADER;

4. dbeaver tweaks (1 hr)

The most important thing in DBeaver, after we've connected to our server, is to remember to R-Click on the server, go into settings and select "Show all Databases". If you forget this step, you will go insane. I didn't know about that step and... just do it. The other thing is a helpful "Test Connection" button down in the L corner.

Finding that some columns have dates, times, numbers, and do we want to use the telephone number as the primary key? Once we have a working concept, we'll want to UML it.

dbeaver - some basics (16:23) QAFox, 2020. Begins about 10:00. Windoze install but still useful.
PBX - on-site or cloud (35:26) Lois Rossman, 2016. Cited mostly for source 17:45 breaks down schematically.
PBX - true overhead costs (11:49) Rich Technology Center, 2020. Average vid, but tells hard facts. Asteriks server ($180) discussed.

python

As noted above, however, we are now interested in letting Python do some of the scripting, so that we don't need two languages. To do this, we install PL/Python on the PostgreSQL side. Other options are available for other languages too -- for example if we want to run statistical "R" programs against a PostgreSQLdatabase, we'd install PL/R. Or we can write-out PL/pgSQL commands and put them into Python or R scripts if we wish.

On the Python side, we obtain modules from PyPi, such as Psycopg2 (also available in pacman repos in Arch). With PL/Python in the database, and Psycopg2 modules in our Python programs, our connections and commands to PostgreSQL become much simpler. And of course, one can still incorporate as much or as little PHP as they wish, and we'd still use judicious amounts of JavaScript in our Apache-served HTML pages. To summarize, in a LAMP we might want:

server-side

PostgreSQL - open source database.
Python - general purpose language, but with modules such as Psycopg2, we can talk to the database and also serve dynamic webpages in HTML, replacing PHP, or even create some GUI application which does this from our laptop.
Apache - still necessary, since this is what interacts with the client. Pytahon or PHP feeds the dynamic content into the HTML, but Apache feeds these pages to the client (browser).
PHP - still available, if we want to use it.

Tuesday, August 11, 2020

google cloud services initialization (cloud, colab, sites, dns, ai)

Some of Google's web services have their own names but are tied together with GCP (Google Cloud Platform) and/or some specific GCP API. GCP is at the top, but a person can enter through lesser services and be unaware of the larger picture. Again, GCP is at the top, but then come related but semi-independent services, Colab, Sites, AI. In turn, each of these might rely on just a GCP API, or be under another service. For example, Colab is tied into GCP, but a person can get started in it through Drive, without knowing its larger role. When a person's trying to budget, it's a wide landscape to understand exactly for what they are being charged, and under which service.

Static Google Cloud site (9:51) Stuffed Box, 2019. Starting at the top and getting a simple static (no database) website rolling. One must already have purchased a DNS name.
Hosting on Cloud via a Colab project (30:32) Chris Titus Tech, 2019. This is a bit dated, so prices have come up, but it shows how it's done.
Hosting a Pre-packed Moodle stack (9:51) Arpit Argawal, 2020. Shows the value of a notebook VM in Colab
Hosting on Cloud via a Colab project (30:32) Chris Titus Tech, 2019. This is a bit dated, so prices have come up, but it shows how it's done.

Google's iPython front-end Colab takes the Jupyter concept one-further, placing configuration and computing on a web platform. Customers don't have to configure an environment on their laptop, everything runs in the Google-sphere, and there are several API's (and TensorFlow) that Google makes available.

During 2020, the documentationi was a little sparse, so I made a post here, but now there are more vids and it's easier to see how we might have different notebooks running on different servers, across various Google products. This could also include something where we want to run a stack, eg for a Moodle. If all this seems arcane, don't forget we can host traditionally through Google Domains. What's going to be interesting is how blockchain changes the database concept in something like Moodle. Currently, blockchain is mostly for smart contract and DAPPs.

Colab

Notebooks are created, ran, and saved via the Drive menu, or go directly to colab.research.google.com. Users don't need a Google Cloud account to use Colab. Easiest way to access Colab is to connect it to one's Drive account, where it will save files anyway. Open Drive, click on the "+" sign to create a new file and go to down to "More". Connect Colab and, from then on, Colab notebooks can be created and accessed straight from Drive.

There's a lot you can do with a basic Colab account, if you have a good enough internet connection to keep up with it. The Pro version is another conversation. I often prefer to strengthen Colab projects by adding interactivity with Cloud.

GUI Creation in Google Colab (21:31) AI Science, 2020. Basics for opening a project and getting it operational.
Blender in Colab (15:28) Micro Singularity, 2020. Inadvertently explains an immense amount about Colab, Python, and Blender.

Colab and Google Cloud

Suppose one writes Python for Colab that needed to call a Google API at some point. Or suppose a person wanted to run a notebook on a VM that they customized? These are the two added strengths of adding Cloud: 1) make a VM (website) to craft a browser project, 2) add Google API calls. Google Cloud requires a credit card.

Cloud and Colab can be run separately, but fusing them is good in some cases. Gaining an understanding of the process allows users to know when to rely on either Colab or Google Cloud or interdependently.

Colab vs. Google Cloud (9:51) Arpit Argawal, 2020. Shows the value of a notebook VM in Colab
Hosting on Cloud via a Colab project (30:32) Chris Titus Tech, 2019. This is a bit dated, so prices have come up, but it shows how it's done.

Note the Google Cloud platform homepage above. The menu on the left is richer than the one in the Colab screenshot higher above. We run the risk of being charged for some of these features so that Google will display potential estimated charges before we submit our requests to use Google API's.

API credentials

We might want to make API calls to Cloud's API's. Say that a Colab notebook requires a Google API call, say to send some text for translation to another language. The user switches to their Cloud account and selects the Google API for translation. Google gives them an estimate of what calls to that API will cost. The user accepts the estimate, and then Google provides the API JSON credentials, which are then pasted into their Colab notebook. When the Colab notebook runs, it can then make the calls to the Google API. Protect such credentials because we don't want others to use them against our credit card.

Cloud account VM details

In the case of running notebooks and you update something, did it update on your machine or googles. its more clear on Google Cloud.

API dependencies

When a person first opens a Colab notebook, it's on a Google server, and the path for the installed Python is typically /usr/local/lib/python/[version]. So I start writing code cells, and importing and updating API dependencies. Google will update all the dependencies on the virtual machine it creates for your project on the server. ALLEGEDLY.

Suppose I want to use google-cloud-texttospeech. Then the way to updates its dependencies (supposedly):

% pip install --upgrade google-cloud-texttospeech

Users can observe all the file updates necessary for the API, unless they add the "-quiet" flag to suppress it. However, no matter that this process is undertaken, when the API itself is called, there can be dependency version problems between Python and iPython.

Note that in the case above the code exits with a "ContextualVersionConflict" listing a 1.16 version detected in the given folder. (BTW, this folder is on the virtual machine, not one's home system). Yet the initial upgrade command AND a subsequent "freeze" command show the installed version as 1.22. How can we possibly clear this since Google has told itself that 1.22 is installed, but the API detects version 1.16? Why are they looking in different folders? Where are they?

problem: restart the runtime

Python imports, iPython does not (page) Stack Overflow, 2013. Notes this is a problem with sys.path.

You'd think of course that there's a problem with sys.path, and to obviate *that* problem, I now explicitly import the sys and make sure of the path in first two commands...

import sys
sys.path.append('/usr/local/lib/python3.6/dist-packages/')

... in spite of the fact these are probably being fully accomplished by Colab. No, the real problem, undocumented anywhere I could find, is that one simply has to restart the runtime after updating the files. Apparently, if the runtime is not restarted, the newer version stamp is not reread into the API.

what persists and where is it?

basic checklist

login to colab.
select the desired project.
run the update cell
restart the runtime
proceed to subsequent cells
gather any rendered output files from the folder icon to the left (/content folder). Alternatively, one can hard code it into the script so that they transfer to one's system:
from google.colab import files
files.download("data.csv")
There might also be a way to have these sent to Google Drive instead.

Thursday, May 21, 2020

path - environment variable

NB: This post is related to a prior one this month regarding 3rd party package managers.

There are two types of environment variables, global (system) level, and local (user) level. Paths are just another environment variable. The tricky part: even though paths are a terminal environment file (see ~.bashrc or ~.profile), applications launched in an X session are presumed to come via the terminal, and so use terminal variables interact with the kernel. Users can also do this directly by exporting variables to the kernel. If we want to see them all

$ printenv

Environment Variables (7:55) Maloco, 2017. Goes over each, notes that they are kept in the

Times when paths are extremely important...

installing applications without using the package manager
making some change in an application's libraries because updates killed the link to its dependencies

package manager case (LaTeX)

I get a lot of use out of LaTeX and occasionally add special stylesheets or other packages. Accordingly, I install TexLive (4GB) directly into a home directory folder (typically "latex") and oversee it separately from Arch's package manager. Oversight is with tlmgr. During install, options appear to change the install path to a local directory if wanted. Once completed, the install will give the following message to update PATHs...

Accordingly...

$ nano .bashrc export PATH=/home/foo/latex/texlive/2020/bin/x86_64-linux/:$PATH export INFOPATH=/home/foo/latex/texlive/2020/texmf-dist/doc/info:INFOPATH export MANPATH=/home/foo/latex/texlive/2020/texmf-dist/doc/man:MANPATH

Then one can also add pdf-latex into their completion on Geany or whatever.

package manager case (Python)

Before going further, Colab is the easy Python coding solution through one's browser, and is generously hosted by Google. But we still might have occasion to code Python while offline, on our own system. In 2020, that's probably...

# pacman -S python-pipenv

... still how did we get here? It used to be complicated. Our goal was to cleanly install Python in a user directory, so that it didn't contaminate our distribution if we used pip to add any non-Arch Python modules. We wanted it a solution that:

allowed interaction with all Arch installed apps
allowed updating or enhancements using pip, but without contaminating the Arch install.
allowed libs whether inside our user directory or the larger Arch distro.

This was not an easy prospect, as succinctly described here. The thread overall advises us that we should use our distribution as much as we can and limit pip use to our "user setup" (local directory), and virtual environments. Anaconda is probably the most well known virtual environment/sandbox for Python, but the newer option of Pipenv (# pacman -S python-pipenv)looks even better.

Pipenv (20:48) Corey Schafer, 2018. Schafer now prefers this VE over Anaconda. Arch repository for the pipenv updates, but use pip inside the VE .
Anaconda use (20:48) Corey Schafer, 2017. This fellow uses both Anaconda and pip.

Saturday, May 2, 2020

non-standard application managers -- wine, python, git, latex (pacman conflicts)

In most Linux distributions, there's a package manager ("PM") application we use to update our OS and any installed applications. The Arch PM is pacman, and full updates require an internet connection for its use. It's a relatively simple process:

# pacman -Syu

The problems begin with applications that have built-in package managers specific to that application. There's only a few, but they are often called upon and u. For example:

LaTeX package manager =tlmgr (TeXLive)
Git package manager = complicated unless at user level. If at user level, install git with pacman and run it as a user, it will just make a directory of source. Otherwise, git apparently downloads "clones" (versions) into /opt, but then generally you want to chmod the clone to foo:foo and move it to a ~/ directory for building, then pacman to install it.
Python package manager = pip
Wine package manager = complicated. Each dll installing in Wine attempts to update itself and will prompt for it. None should be authorized. Also, building a version of Wine that is custom configured for the MSoft app is critical, but the underlying Arch version of wine tends to wipe them out any time it is called.

Using any of these application-specific updaters within Arch leads to failures during the next "pacman". Don't use them at all except when knowing the workarounds (see below). Essentially, there are zero problems for users who install and update the Arch Linux OS and applications using only pacman. Again, the most reliable complete update command:

# pacman -Syu

...or for application removal...

# pacman -Rs

There's also one workaround in Arch that doesn't f*ck up the index: if I download source into some directory and run...

$ makepkg -si [somepkg]

... it will make the package, and then prompt me for the password for install ("-i" flag) at the end and perform a pacman -U [package] at the end of the make. This # pacman -U properly updates the pacman index. As far as I know, # pacman -U is the only way to install outside packages without farking an install. This also means we can reliably use # pacman -Rs to uninstall packages put in through this method.

1. LaTeX - tlmgr 5Gb

2022 edit: I now use pacman, until work on thesis or something, then can do a -Rsn uninstall and do it here.

Don't install any version of LaTex with pacman. It doesn't have enough templates.

Install Tex-Live (LaTeX) directly into a user subdirectory (eg. "/home/foo/latex") and update it through the TexLive PM (tlmgr), but only at the user level, so that no admin level changes, or files used by pacman are affected.

$ cd /home/foo/latex/bin/x86_64-linux
$ tlmgr update --self
$ tlmgr update --all

UNLESS, it's the next year and haven't updated. This is called a "cross-release" update. I had to download that year's update-tlmgr.sh and run it. Described further down in post and here.

2. Git - pacman + manually configure

This took me the longest to learn, as this site agrees. Configuring for upstreaming -- if using as VCS -- is time consuming, but it's different if downloading some source. Update the git app with pacman, update the sources using git commands. Here's the Arch page.

install the foundational Git application using # pacman -S git.

3.Python - pip 2Gb

Nearly every program uses Python, so we want to be sure not to break our Python installation.

The solution required three independent solutions, all three of which I install. I select one of the solutions depending on what the application requires/expects.This solution takes up several gigabytes.

initialize a Colab account in Google. See my post here, because there's a few steps to it. Colab handles Jupyter-type operations online. I do this development online (and save files .PY to Drive). I used to do Jupyter environment on the desktop, and this was the heaviest pip user, disrupting pacman almost entirely. IMO, Colab is practically a godsend from Google.
when an application requires it, a pacman install of Python. Other applications installed using pacman, may also call for additional Python modules which they add to the basic Python, and/or they may also require an older version of Python to be installed.
using pacman install a virtual environment for the

For Python, we can use pip, and for LaTex we can use tlmgr, to upgrade or add features to each app. The overall OS also has a package manager. For Arch, this is pacman. We can use pacman to upgrade or add packages to the Arch OS, including adding Arch versions of Python and LaTeX. The problem is, any actions accomplished with pip or tlmgr are not recorded in the pacman application index. Discrepancies between pacman's index and pip and tlmgr operations lead to significant application and OS problems. Prevention requires decisions during installation.
Contents

problems
solutions
prevention

problems

The Python case is more complicated to *solve* than the LaTeX situation, because a variety of applications depend on Python, unlike LaTeX. However, both situations equally *cause* pacman update fails for the same, indexing, reason, so that both the Python and the LaTeX cases must be solved.

Python

Many applications rely upon Python. More technically, Python is a dependency for for many applications.

Imagine some application with a Python dependency, say youtube-dl. Imagine that both Python and youtube-dl were installed by pacman. They both work fine. Now suppose the user decides to add a Python module using pip instead of pacman. Everything might be fine with some applications that depend on Python, but youtube-dl is an example of an application sensitive to Python changes. The youtube-dl app looks for versions of Python installed by Arch, but will instead detect versions updated by pip. This discrepancy in turn leads youtube-dl to spawn errors, and errors in turn lead to the application exiting. How can this be solved?

$ youtube-dl https://www.youtube.com/watch?v=QLpz7PtiP2k
Traceback (most recent call last):
File "/usr/bin/youtube-dl", line 6, in
from pkg_resources import load_entry_point
File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3259, in
def _initialize_master_working_set():
File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3242, in _call_aside
f(*args, **kwargs)
File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3271, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 584, in _build_master
ws.require(__requires__)
File "/usr/lib/python3.8/site-packages/pkg_resources/__init__.py", line 901, in require
needed = self.resolve(parse_requirements(requirements))

More drastically, this can happen with an entire OS. Say the user keeps his OS up to date with pacman, and decides to install Python using pacman. Later, he upgrades Python using Python's pip manager. Still later, the user attempts to update the entire OS, which is a pacman action. As part of its upgrade procedure, pacman verifies the integrity of all installed applications against its index. Since the changes made to Python via pip were not registered into pacman's version index, pacman cannot resolve the pip installed modules against the pacman index. These discrepancies cause the pacman update to exit with errors. No update will be accomplished. How can this be solved?

LaTeX

If having any problems with LaTeX updates, even the tug.org website recommends just replacing with latest edition. However, there is one workaround if things aren't more than a year or two old.

$ tlmgr update --self
tlmgr: Local TeX Live (2020) is older than remote repository (2021).
Cross release updates are only supported with
update-tlmgr-latest(.sh/.exe) -- --upgrade
See https://tug.org/texlive/upgrade.html for details.

For the problem above, download update-tlmgr-latest.sh, make it executable (chmod +x or file manager), and put it in /home/foo/latex or wherever your latex user-level top directory is at. Go into that directory and:

$ sh update-tlmgr-latest.sh -- --upgrade
$ tlmgr update --self
$ tlmgr update --all

Wine

Wine also can pull-in specific packages, when we try to make wine bottles. However, any time there's an OS update, pacman will write a new wine directory in ~/.wine. After it's overwritten, it will use that vanilla version instead of the wine bottle we built for the app. How can this be solved?

solutions

In the case of an app like youtube-dl, it's easy to uninstall the entire app (# pacman -Rns) and reinstall. Or to play with uninstalling its pieces until one has a hit

The levels of involvement vary, but the strategy is the same for both Python and for LaTeX: install these independently of Arch from the start, in /home/user directory, and then update any path statements necessary for the Arch installed apps to find the program. Once these are in at the user level, there's no problem updating them with pip and tlmgr, because the updates are strictly within one's home directory, not the larger Arch installation. Accordingly, one can add features at will, which is often desirable when running the latest Spyder or what have you. For LaTeX, it's even less of a hassle, because there are no dependencies: one just needs to update $PATH statements following install. One can create an install in either case using, eg. $ mkdir /home/foo/latex and similarly with Python, though perhaps using a new folder for any version changes.

In the case of the OS update failure, one has no pleasant choices :uninstall then reinstall all of Python (immense time drain), or update the OS using an override (# pacman -SyuI), which then leaves a potentially unstable Python install for its dependencies. This is true with LaTeX also, but nearly zero apps depend on LaTeX.

install notes

I've decided to use the 3rd party updaters to both install (Python - pip, LaTeX - tlmgr for TexLive), There are a couple of considerations.

Install in user-level directory
update path statements afterward, so the installation can find them. Path statements can be complex, because there are various Path statements to consider.

prevention

Sometimes one doesn't have a choice but to install packages via pip, but then what happens downstream with a pacman upgrade is ugly. It will look like this, and there's a couple of workarounds.

uninstall the pip upgrades, upgrade the system via pacman, then go back to pip and install whatever necessary pip packages.
add the --overwrite=* flag to the standard "Syu". This causes pacman to simply overwrite any python files that seem inaccurate. This places pacman in the boss position over pip, but it also farks-up the pip version of the installation which, downstream, is a b*tch.

Monday, April 20, 2020

PiP screencast pt 1

contents

capture organizing the screen, opening PiP, capture commands	speed changes slow motion, speed ramps
cuts	scripting
precision cuts	audio and sync
other effects fade, text, saturation	subtitles/captions

NB: Try to make all cuts at I-frame keyframes, if possible.

Links: 1) PiP Pt II 2) capture commands 3) settings

Editing video in Linux becomes a mental health issue after a decade or more of teeth grinding with Linux GUI video editors. There are basically two backends:ffmpeg and MLT. After a lost 10 years, some users like me resign themselves to command line editing with ffmpeg and melt (the MLT CLI editor).

This post deconstructs a simple PiP screencast, perhaps 6 minutes long. A small project like this exposes nearly all the Linux editing problems which appear in a production length film. This is the additional irony of Linux video editing -- having to become practically an expert just to do the simplest things; all or nothing.

At least five steps are involved, even for a 3.5 minute video.

get the content together and laid out, an impromptu storyboard. What order do I want to provide information?
verify the video inputs work
present and screencapture - ffplay, ffmpeg CLI
cut clips w/out render - ffmpeg CLI
assemble clips with transitions - ffmpeg CLI

capturing the raw video

The command-line PiP video setup requires 3 terminals to be open, 1) for the PiP, 2) for the document cam, 3) for the screen capture. Each terminal has a command. 1) ffplay, 2) ffplay, 3) ffmpeg.

1. ffplay :: PiP (always on top)

The inset window of the host narrating is a PiP that should always be on top. Open a terminal and get this running first. The source is typically the built in webcam, trained on one's face.

$ ffplay -i /dev/video0 -alwaysontop -video_size 320x240

The window always seems to open at 640x480, but then resized down to 160x120 and moved anywhere on the desktop. And then to dress it up with more brightness, some color sat, and mirror flipped...

ffplay -i /dev/video0 -vf eq=brightness=0.09:saturation=1.3,hflip -alwaysontop -video_size 320x240

2. ffplay :: document cam

I start this secondly, and make it nearly full sized, so I can use it interchangeably with any footage of the web browser.

$ ffplay -i /dev/video2 -video_size 640x480

3. ffmpeg :: screen and sound capture

Get your screensize with xrandr, eg 1366x768, then eliminate the bottom 30pixels (20 on some systems) to omit the toolbar. If the toolbar isn't shown, it can be used during recording to switch windows. Syntax: put the 3 flags in this order:

-video_size 1366x738 -f x11grab -i :0

...else you'll probably get only a small left corner picture or errors. Then come all your typical bitrate and framerate commands

$ ffmpeg -video_size 1366x738 -f x11grab -i :0 -r 30 output.mp4

This will encode a cleanly discernable screen at a cost of about 5M every 10 minutes. The native encoding is h264. If a person wanted to instead be "old-skool" with MPEG2 (codec:v mpeg2video), the price for the same quality is about 36 times larger: about 180M for the same 10 minutes. For MPEG2, we set a bitrate around 3M per second (b:v 3M), to capture similarly to h264 at 90K.

Stopping the screen capture is CTRL-C. However: A) be certain CTRL-C is entered only once. The hard part is, it doesn't indicate any change for over a minute so a person is tempted to CTRL-C a second time. Don't do that (else untrunc). Click the mouse on the blinking terminal cursor to be sure the terminal is focused, and then CTRL-C one time. It could be a minute or two and the file size will continue to increase, but wait. B) Before closing the terminal, be certain ffmpeg has exited.

If you CTRL-C twice, or you close the terminal before ffmpeg exits, you're gonna get the dreaded "missing moov atom" error. 1) install untrunc, 2) make another file about as long as the first but which exits normally, and 3) run untrunc against it.

Explicitly setting the screencast bitrate (eg, b:v 1M b:a 192k) typically spawns fatal errors, so I only set the frame rate.

Adding sound...well you're stuck with PulseAudio if you installed Zoom, so just add -f pulse -ac 2 -i default...I've never been able to capture sound in a Zoom meeting however.

$ ffmpeg -video_size 1366x738 -f x11grab -i :0 -r 30 -f pulse -ac 2 -i default output.mp4

manage sound sources

If a person has a Zoom going and attempts to record it locally, without benefit of the Zoom app, they typically only hear sound from their own microphone. Users must switch to the sound source of Zoom itself to capture the conversation. This is the same with any VOIP, of course. This can create problems -- a person needs to make a choice.

Other people will say that old school audio will be 200mV (0.002), p-p (peak-to-peak). Unless all these signals are changed to digital, gain needs to be set differently. One first needs to know the name of the devices. Note that strange video tells more about computer mic input at than I've seen anywhere.

basic edits, separation, and render

Link: Cuts on keyframes :: immense amounts of information on cut and keyframe syntax

Ffmpeg can make non-destructive, non-rerendered cuts, but they may not occur on an I-frame (esp. keyframe) unless seek syntax and additional flags are used. I first run $ ffprobe foo.mp4 or $ ffmpeg -i foo.mp4on the source file: bitrate, frame rate, audio sampling rates, etc. Typical source video might be 310Kb h264(high), with 128 kb/s, stereo, 48000 Hz aac audio. Time permitting, one might also want to obtain the video's I-frame (keyframe) timestamps, and send them to a text file to reference during editing...

$ ffprobe -loglevel error -skip_frame nokey -select_streams v:0 -show_entries frame=pkt_pts_time -of csv=print_section=0 foo.mp4 >fooframesinfo.txt 2>&1

no recoding, save tail, delete leading 20 seconds. this method places seeking before the input and it will go to the closest keyframe to 20 seconds.
$ ffmpeg -ss 0:20 -i foo.mp4 -c copy output.mp4
no recoding, save beginning, delete tailing 20 seconds. In this case, seeking comes after the input. Suppose the example video is 4 minutes duration, but I want it to be 3:40 duration.
$ ffmpeg -i foo.mp4 -t 3:40 -c copy output.mp4
Do not forget "-c copy" or it will render. Obviously, some circumstances require this level of precision, and a person has little choice but to render.
$ ffmpeg -i foo.mp4 -t 3:40 -strict 2 output.mp4
This gives cleaner transitions.
save an interior 25 second clip, beginning 3:00 minutes into a source video
$ ffmpeg -ss 3:00 -i foo.mp4 -t 25 -c copy output.m4

...split-out audio and video

$ ffmpeg -i foo.mp4 -vn -ar 44100 -ac 2 sound.wav
$ ffmpeg -i foo.mp4 -c copy -an video.mp4

...recombine (requires render) with mp3 for sound, raised slightly above neutral "300", for transcoding loss

$ ffmpeg -i video.mp4 -i sound.wav -acodec libmp3lame -ar 44100 -ab 192k -ac 2 -vol 330 -vcodec copy recombined.mp4

precision cuts (+1 render)

Ffmpeg doesn't allow for frame number cutting. If you set a time without recoding, it will rough cut to a number of seconds and a decimal. This works poorly for transitions. So what you'll have to do is recode it and enforce strict time limits, then take it time the number of frames. You can always bring the clip into Blender to see the exact number of frames. Even though Blender is backended with Python and ffmpeg, it somehow counts frames a la MLT.

other effects (+1 render)

Try to keep the number of renders as low as possible, since each is lossy.

fade in/out

...2 second fade-in. It's covered directly here, however, it requires the "fade" and "afade" filters which don't come standardly compiled in Arch, AND, it must re-render the video for this.

$ ffmpeg -i foo.mp4 -vf "fade=type=in:duration=2" -c:a copy output.mp4

For the fade-out, the location must be made in seconds, most recommend using ffmprobe, then just enter the information 2 seconds before you want it. This video was 7:07.95, or 427.95 seconds. Here it is embedded with some other filters I was color balancing and de-interlacing with.

$ ffmpeg -i foo.mp4 -max_muxing_queue_size 999 -vf "fade=type=out:st=426:d=2,bwdif=1,colorbalance=rs=-0.1,colorbalance=bm=-0.1" -an foofinal.mp4

text labeling +1 render

A thorough video 2017,(18:35) exists on the process. Essentially a filter and a text file, but font files must be specified. If you install a font manager like gnome-tweaks, the virus called PulseAudio must be installed, so it's better to get a list of fonts from the command line

$ fc-list

...and from this pick the font you want in your video. The filter flag will include it.

-vf "[in]drawtext=fontfile=/usr/share/fonts/cantarell/Cantarell-Regular.otf:fontsize=40:fontcolor=white:x=100:y=100:enable='between(t,10,35)':text='this is cantarell'[out]"

... which you will want to drop into the regular command

$ ffmpeg -i foo.mp4 -vf "[stuff from above]" -c:v copy -c:a copy output.mp4

...however this cannot be done because streamcopying cannot be accomplished after a filter has been added -- the video must be re-encoded. Accordingly, you'll need to drop it into something like...

$ ffmpeg -i foo.mp4 -vf "[stuff from above]" -output.mp4

Ffmpeg will copy most of the settings, but I do often specify the bit rate, since ffmpeg occasionally doubles it unnecessarily. This would just be "q:v "(variable), or "b:v "(constant). It's possible to also run multiple filters; put a comma between each filter statement.

$ ffmpeg -i foo.mp4 -vf "filter1","filter2" -c:a copy output.mp4

saturation

This great video (1:08), 2020, describes color saturation.

$ ffmpeg -i foo.mp4 -vf "eq=saturation=1.5" -c:a copy output.mp4

speed changes

1. slow entire, or either end of clip (+1 render)

The same video shows slow motion.

$ ffmpeg -i foo.mp4 -filter:v "setpts=2.0*PTS" -c:a output.mp4
OR
$ ffmpeg -i foo.mp4 -vf "setpts=2.0*PTS" output.mp4

Sometimes the bitrate is too low on recode. Eg, ffmpeg is likely to choose around 2,000Kb if the user doesn't specify a bitrate. Yet if there's water in the video, it will likely appear jerky below a 5,000Kb bitrate...

$ ffmpeg -i foo.mp4 -vf "setpts=2.0*PTS" -b 5M output.mp4

2. slowing a portion inside a clip (+2 render)

Complicated. If we want to slow a 2 second portion of a 3 minute normal-speed clip, but those two seconds are not at either end of the clip, then ffmpeg must slice-out the portion, slow the portion (+1 render), then concatenate the pieces again (+1 render). Also, since the single clip temporarily becomes more than one clip, a filter statement with a labeling scheme is required. It's covered here. It can be covered in a single command, but it's a big one.

Suppose we slow-mo a section from 10 through 12 seconds in this clip. The slow down adds a few seconds to the output video.

$ ffmpeg -i foo.mp4 -filter_complex "[0:v]trim=0:10,setpts=PTS-STARTPTS[v1];[0:v]trim=10:12,setpts=2*(PTS-STARTPTS)[v2];[0:v]trim=12,setpts=PTS-STARTPTS[v3];[v1][v2][v3] concat=n=3:v=1" output.mp4

supporting documents

Because of the large number of command flags and commands necessary for even a short edit, we can benefit from making a text file holding all the commands for the edit, or all the text we are going to add to the screen, or the script for TTS we are going to add, and a list of sounds, etc. With these three documents we end up sort of storyboarding our text. Finally, we might want to automate the edit with a Python file that runs through all of our commands and calls to TTS and labels.

basic concatenation txt

Without filters, file lists (~17 into video) are the way to do this with jump cuts.

python automation

Python ffmpeg scripts are a large topic requiring a separate post; just a few notes here. A relatively basic video 2015,(2:48) describing Python basics inside text editors. The IDE discussion can be lengthy also, and one might want to watch this2020, (14:06), although if you want to avoid running a server (typically Anaconda), you might want to run a simpler IDE (Eric, IDLE,), PyCharm, or even avoid IDE's2019,(6:50). Automating ffmpeg commands with Python doesn't require Jupyter since the operations just occur on one's desktop OS, not inside a browser.

considerations

We want to have a small screen of us talking about a larger document or some such and not just during recording

we want the small screen PiP to always be on top :: use -alwaysontop flag
we'd like to be able to move it
we'd like to make it smaller than 320x240

link: ffplay :: more settings

small screen

$ ffplay -f video4linux2 -i /dev/video0 -video_size 320x240
OR
$ ffplay -i /dev/video0 -alwaysontop -video_size 320x240

...then to keep it always on top

commands

The CLI commands run long. This is because ffmpeg defaults run high. Without limitations inside the commands, ffmpeg pulls 60fps, h264(high), at something like 127K bitrate. Insanely huge files. For a screencast, we're just fine with

30fps
h264(medium)
1K bitrate

flag	note
b:v	4Kb if movement in the PiP is too much, up this
f	x11grab must be followed immediately with a second option "i", and eg, "desktop" this will also bring h264 codec
framerate	30. Some would drop it to 25, but I keep with YouTube customs even when making these things. Production level would be 60fps
b:v	1M if movement in the PiP is too much, up this
Skype	1-1, MicroSoft data collection for the US Govt

video4linux2

This is indespensable for playing one's webcam on the desktop, but it tends to default to highest possible framerates (14,000Kbs), and to a 640x480 window-size though the latter is resizeable. The thing is, it's unclear whether this is due to the vidoe4linux2 codec settings, or upon the ffplay which uses it. So is there a solid configuration file to reset these? This site does show a file to do this.

scripting

You might want to run a series of commands.The key issue is figuring the chaining. Do you want to start 3 programs at once, one after the other, one after the other as each one finishes, one after the other with the input of the prior program as the input for the next?

Bash Scripting (59:11) Derek Banas, 2016. Full tutorial on Bash scripting.
Linking commands in a script (Website) Ways to link commands.

$ nano pauseandtalk.sh (don't need sh, btw)
#!/bin/bash

There are several types of scripts. You might want a file that sequentially runs a series of ffmpeg commands, or you might want to just have a list of files for ffmpeg to look at to do a concatanation, etc.

Sample Video Editing Workflow using FFmpeg (19:33) Rick Makes, 2019. Covers de-interlacing to get rid of lines, cropping, and so on.
Video Editing Comparison: Final Cut Pro vs. FFmpeg (4:44) Rick Makes, 2019. Compares editing on the two interfaces, using scripts for FFmpeg

audio and narration/voiceover

Text-to-speech has been covered in another post, however there are commonly times when a person wants to talk over some silent video. $ yay -S audio-recorder. How to pause the video and speak at a point, and still be able to concatenate.

inputs

If you've got a desktop with HDMI output, a 3.5mm hands-free mic won't go into the video card, use the RED 3.5mm mic input, then filter out the 60hz hum. There are ideal mics with phantom power supplies, but even a decent USB mic is $50.

For syncing, you're going to want to have your audio editor running and Xplayer running same desktop. This is because it's easier to edit the audio than the video, there's no rendering to edit audio.

Using only Free Software (12:42) Chris Titus Tech, 2020. Plenty of good audio information (including Auphonic starting at 4:20; mics (don't use the Yeti - 10:42) and how to sync (9:40) at .4 speed.
Best for less than $50 (9:52) GearedInc, 2019. FifinePNP, Blue Snowball. Points out that once we get to $60, it's an "XLR" situation with preamps and so forth to mitigate background noise.
Top 5 Mics under $50 (7:41) Obey Jc, 2020. Neewer NW-7000Compares editing on the two interfaces, using scripts for FFmpeg

find the microphone - 3.5mm

Suppose we know we're using card 0

$ amixer -c0
$ aplay -l

These give us plenty of information. However, it's still likely in an HDMI setup to hit the following problem

$ arecord -Ddefault test-mic.wav
ALSA lib pcm_dsnoop.c:641:(snd_pcm_dsnoop_open) unable to open slave
arecord: main:830: audio open error: No such file or directory

This means there is no "default" configured in ~./asoundrc. There would be other errors too, if not specified. The minimum command specifies the card, coding, number of channels, and rate.

$ arecord -D hw:0,0 -f S16_LE -c 2 -r 44100 test-mic.wav

subtitles/captions

Sunday, March 1, 2020

toolbox - data

Statistics knowledge comes first, of course. Looking at Python apart from Jupyter, we can make some data-related assumptions about modules. Now of course, Google Colab is even easier than arranging Jupyter or Virtual Environments on my own system, so let's leave aside system setups and sandboxes which I cover in another post on environments and their variables.

Data Science, which I think of "dynamic statistics" is overtaking classical statistics. in Classical stats, we had to accurately create hypotheses before null-testing them. In dynamic statistics one must have accurate code to let the data bubble up its own conclusison. In software, Python is rapidly overtaking R, esp since Google made Colab and TensorFlow available via browsers, impossibly. On a single system, it's more complex, as noted above. For learning Python in a Data Science centric manner, practicals can be stock or derivatives market, climate, or epidemial information. To try models against wall street quants, one can play with models at quantopian.com.

Getting Started with Colab (7:17) ProgrammingKnowledge, 2020. Intro to what is essentially Jupyter notebook, cloud version, which Google now hosts.
TensorFlow in 10 minutes (9:00) edureka, 2019. Google recently began including TensorFlow into its Colab, so we now have a complete machine learning environment.
How I Would Learn Data Science (8:35) Ken Jee, 2020. Several websites and specific methods, 5 years for this guy. He emphasizes practical projects.
Practical Scraping (31:56) Computer Science, 2020. Colab project. How to work the practical on Python. He gets to his function around 15:00, prior is visualizing.

type	note
Geany	text-based for nearly any code via plugins
Jupyter	web-integrated Python, designed to display output in a browser.
Eclipse	java-based, takes plugins for, eg RStudio
RStudio	R-specific IDE
PSPP	GNU version of SPSS. Does most. $ yay -S pspp. GUI psppire
gretl	econometrics. $ yay -S gretl. GUI gretl
octave	GNU version of MATLAB $ pacman -S octave. GUI octave

Data and Statistics (code)

10 Python Tips (39:20) Corey Schafer, 2019
Python NumPy overview w/arrays (58:40) Keith Galli, 2019
Jupyter - Python Sales Analysis Example (1:26:07) Keith Galli, 2020
Pandas - Data Science Tutorial (1:00:27) Keith Galli, 2018 CSV reading, beautiful soup
Python Stock Prediction (49:48) Computer Science, 2020
Beautiful Soup stock prices (10:47) straight_code, 2019
Options analysis in Python (1:02:22) Engineers.SG, 2016 Black-Scholes (emotional volatility) in Pandas.
Derivative analytics in Python (1:29:27) O'Reilly, 2014 Data frames and Monte Carlo (brownian).

Data and Statistics (classic)

Combinations vs. Permutations (20:59) Brandon Foltz, 2012 For either finite math or stats
Linear Regression Playlist (multiple) Brandon Foltz, 2013
Covariance basics (26:22) Brandon Foltz, 2013 stock examples, vs correlation/linear regression
Pandas - Data Science Tutorial (1:00:27) Keith Galli, 2018 CSV reading, beautiful soup
Python Stock Prediction (49:48) Computer Science, 2020

Saturday, June 1, 2019

XML information retrieval and display (python, XSL)

In most cases, we want to retrieve unformatted XML files, create a stylesheet (XSL, XSD), and view the XML file in a browser.

In a few cases, we may want to automate reading-in the XML (eg. w/Python), strip key information, and place the information into a database (eg. PostgreSQL). We can then create a template for recalling the XML info into a browser.

The simple display situation is similar to what LEO agencies do forensically for Court. The latter case of data retreival is more similar to what intelligence agencies do with electronic records. There's overlap of course. At any rate, citizens are forced to almost create their own wheel on this issue, probably b/c there's zero government/business incentive to allow citizens to access to information (since c. 2001).

In my computer filing system, I put each XML project in a folder. Because inside each project, there may be several different applications used as well as the HTML, XML, XSL files. There's too large a variety of software -- python, postgresql,etc -- used to store project under applications.

1. stylesheet solution (Geany)

We can add style information into an XML file, same as we do inside an HTML file if we add a "style" section in its header. Or we can have an XML file call to a separate style-centric XML file, which we usually re-suffix as an XLS file to point out its style usage. The latter is similar to putting all HTML formatting into a separate CSS file.

We can use any plain text editor, I often use Geany. We want our displayed HTML, XSL, or XSD files to be standalone, so we don't run SMS files through outside servers to re-format. Tightly constructed headers are necessary for this.

XML (input): the basic XML file which may not be conceptually clear or human readable, possibly with many attributes
XSL: the map we create formatting these tags, similar to a CSS in HTML, ie called or married to the document. More here. Many browsers consider XSL a security risk if placed in same directory as XML.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0">
Document
</xsl:stylesheet>
An optimization will be to place the XSL style information into the HTML or XML header.
XSD
XML (output): our finished file structured in a way we can read and with a dictionary for future use.

XSLT to HTML (15:57) Brandon Jones, 2018. Skip first 4 minutes. Uses an intermediate step of an XSL file. 5:30: "marry" XML to an XSD to tell it how to display, but actually uses an XSL for the browser?
attribute notes (19:06) Kent D. Lee, 2013. important XML header information first couple of minutes. TCX file is Garmin proprietary XML.

2. python manipulation

python attribute retrieval (19:06) Kent D. Lee, 2013. teacher at luther.edu uses a proprietary TCX Garmin file, their tagged XML, and harvests information for his own use. 8:50 how to retrieve dictionary of attributes for an element (tag).
XSLT to HTML (15:57) Brandon Jones, 2018. Skip first 4 minutes. Uses an intermediate step of an XSL file. 5:30: "marry" XML to an XSD to tell it how to display, but actually uses an XSL for the browser?
basic extraction w/python (14:51) Extreme Automation Kamal Girdher, 2019. inflected English narrator. simplistic extraction of tree items.

XML considerations

SMS B&R's backups are standalone XML's with two label types: root labels ("smses"), and child labels. There is of course only one root label per XML file, but the child ("sms") labels number in the hundreds, depending on how many texts were backed-up. Each SMS or MMS is another child label, with its data stored in child label's attributes. I must write Python and/or XSL which harvests the information in the child labels, and then chronologically assembles texts beneath the correct phone number(s).

coding considerations

In Python it's trivial to create an output ASCII file; just add "w+" to some print statement, then close the file. For the data extraction though, there are thousands of approaches. Given my limited Python ability, I considered roughly two:

Schema1: XML is in date order. Open the XML, read all the cell numbers into a set (removes duplicates). First cell number in the set is tested against each line in the XML: if cell numbers match, the XML row is written to the text file. Second cell number in the set is tested against each line in the XML: if cell numbers match, the XML row is written to the text file. And so on, through the cell number set. Next, repeat this process with the second cell number in the set. And the third. Continue until each cell numbers in the set have been matched against the XML, and written to the output file.

Schema2: XML is in date order. Keeping date order, sort all instances of same number. So number order, then date order within each grouping of numbers. Write all of this to a file with a header between each grouping of numbers.

Schema 1

1) Create a set of all the cell numbers. We want a set, not a list, b/c Python sets exclude any non-unique items, unlike lists. We only want one instance of each cell number. Sets are iterable, but have no indices.

import os
import xml.etree.ElementTree as et

# read-in the XML file and get the root
xml_file="/home/foo/py/sms-20190303033634.xml"
tree=et.parse(xml_file)
root=tree.getroot()

# iterate through the XML file and create a set of
# telephone numbers
nu = set()
for sms in root.findall('sms'):
numb=sms.get('address')[-10:]
nu.add(numb)

XSL

Links: StackOverflow :: 2 conditions select (union) :: YouTube (8:17)

The Chromium team at Google apparently disallows XSL files to be in the same directory with an XML. I wasted a day writing reliable XSL files but wouldn't open the XML in Chromium. Finally, I remembered to tap F12 and look for errors: the "security" violation warning was obvious immediately. This Chromium XSL policy is a heated bug discussion on the Web. Whatever, I simply downloaded Firefox and the XSL displayed the XML. The basic template for an XSL is below:

<?xml version='1.0' encoding='UTF-8'?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match= "[rootlabel]">

[Code to transform the text]

</xsl:template>
</xsl:stylesheet>

A line is also added to the source XML file pointing to this XSL file, similarly to how HTML and CSS are used together. HTML is just one type of XML code.

project specific

As noted at the top, I wanted XSL-transformed XML to approximate the EMT text format. I started to build the XSL: 1) extract phone numbers, 2) test each XML row against a phone number 3) write to text, 4) repeat with the next phone number, until all phone numbers were tested and printed.

background

There are certain benign actions which the powers-that-be nevertheless make difficult for citizens, either by security design or by the reverse invisible hand of dmca or other rent-seekers. One of these is an easy-to-read record of SMS's.

SMS retention is trivial for forensics and surveillance; they use SMS's regularly against citizens. What smells bad to me about that? 1) citizens should be on an equal playing field. 2) SMS retention should be easy (ASCII) for any citizen, 3) SMS retention used to be easy and mysteriously became difficult. For example, there used to be a simple app in the Android Play Store called Email My Texts. EMT backed-up SMS's and MMS's in a simple, intuitive, ASCII format (screenshot below). As you can see, EMT did not back-up MMS attachments, but it added a line of text to media MMS's noting that a media file had been attached. The ASCII format made the backup file on Dropbox easily searchable via a browser.

Try to find something like this nowadays that doesn't go through a questionable server somewhere. The closest you can get today is to download from your phone in some XML format. This of course means you'll have all the "user-friendlyiness" of inscrutable XML tags and no way to format your files. Let's see if we can find some way to harvest all the root tags and reformat it into eg, an html page, etc. Probably we can't. This will be a time consuming process with a the necessary addition of a huge CSS file or a an immense "style" header.

In about 2018, EMT disappeared from the Android store. The developer's website noted EMT was discontinued, but did not provide an explanation. I could find no apps in 2019 which produced a similar result to EMT -- the current batch of back-up apps produce annoying formats: PDF's, XML, CSV, proprietaries. All of these are undesirable compared to ASCII. After some research, the format which seemed the most simple to convert to ASCII seemed to be XML. So, a few weeks ago, I purchased the ad-free XML backup client SMS Backup and Restore Pro ($5). Its high price seemed worthy insofar as, by avoiding ads, I could likely avoid Google or App developers parsing my private texts for ad relevance.

As for XML, I figured I could create a layman/amateurish Python script or XSL file to parse the XML and convert it to a text file. I had no intention to use, eg., the timeit module or code optimize.

objective

Simple: translate SMS B&R's XML backup file into an ASCII format, similar to that in the screenshot above.

Saturday, November 12, 2016

openshot-qt fickleness

I recently attended a wedding anniversary and downloaded openshot-qt soon after, from the Arch repository. Openshot didn't open on the first attempt, instead spawning some window errors. To be sure which version(s) of QT the package was compiled with, I ran ldd against it, but got the following (surpising¹) result:

$ ldd /bin/openshot-qt
not a dynamic executable

I then went over to read-up on any idiosyncrasies about the installation, and noticed...

"bwaynej commented on 2016-09-26 02:37
openshot-qt depends on python-pip in order to run"

...ergo...

# pacman -S python-pip desktop-file-utils libopenshot python python-httplib2 python-pyqt5 qt5-webkit shared-mime-info

Most of these were re-installs. Thereafter, to determine where everything was placed:

$ pacman -Ql openshot

solution

Went into yaourt and updated openshot-qt, during which I was queried for deleting the "conflicting" package openshot². I authorized this and openshot-qt finally produced a display, albeit window errors continued.

¹ for a presumably large application like a video editor. Eg, I would expect openshot-qt to be a ridiculously, possibly even unusably, massive app if compiled statically.
² In spite of significant searching, I never found an installed or tarball version of openshot on my system -- the "conflict" warning was the only clue of interference from the old openshotpackage.

Sunday, August 30, 2015

python and sqlite

Links:

A previous post discussed the huge (250MB) installation of PostgreSQL for something as robust as a LAMP. However, suppose I just need a small database to keep track of business cards or some such?

SQLite CLI

Unlike PostgreSQL, SQLite has no server: everything is contained in a local "db" file you create. You can easily back it up by just backing up the file. To "connect" to the database, just go in the directory with that database and find the database, let's say it's "sample.db". Then...

$ sqlite3 sample.db

... and just do your business. Once complete, it's ".exit". Much easier for this kind of thing is a simple GUI, like sqliteman (Qt based).

Python access - APSW

But in our Python app, we'll want to access this database directly from the app, so we need a class of Python commands. If we want we can use SQLite commands and just import SQLite into the our code...

#!/usr/bin/python
import os,sys,time
import sqlite3

... and use SQL language, intermixed with Python.

But if we want a wrapper with a set of Python classes, the easiest Arch module to install is APSW (# pacman -S apsw) which is technically a wrapper, not a module, but still provides a class for interacting with SQLite databases.

Once we have that, we can start our code with, say...

#!/usr/bin/python
import os,sys,time
import apsw

...and and go on from there. However, I don't need any SQLite wrappers for the level of programming that I do --- I just write the database related commands in SQL. It's not that hard.

Python application

Remember that standalone Python programs are relatively simple to accomplish in Linux, but it depends on whether one compiles against a Python release and set of libs, or attempts to include these all in the package. Since most of what you run on your machine are scripts, like Bash scripts, the only thing you need to worry about among Python releases is what modules are natively available. The way to get to this is to first determine the highest version of Python on your system, and then see if you have modules you need to run your scripts.

Example

Suppose I'm building a script that I want to create a GTK window for (or Qt, but lets use GTK in this example), and so I've got gtk3 and pygtk installed. I'm going to check the version of Python and then list its modules and see if it has gtk and pygtk modules.

$ python --version
Python 3.4.3
$ python3
Python 3.4.3 (default, Mar 25 2015, 17:13:50)
[GCC 4.9.2 20150304 (prerelease)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> help('modules')
[all modules are listed]
>>> exit()

I check the modules list and note gtk and pygtk are in the Python 2 modules list, but are not in the Python 3 modules list. Some Googling shows that Python 3 has a compatibility module, pygtkcompat, which emulates pygtk. All I need to do then is import pygtkcompat and configure it. Following that, I treat my code as gtk and pygtk existed. For example, the script might begin...

#! /usr/bin/python3

import os, sys, time
import pygtkcompat

# enable "gobject" and "glib" modules
pygtkcompat.enable()
pygtkcompat.enable_gtk(version='3.0') #matches shebang
import gtk
import glib

...and then whatever code came subsequently. Normal GTK "WINDOW" commands are available for example.

Monday, March 29, 2010

Python - PHP - PostgreSQL - Android - C++

Links: Python GUI thread wxPython tutorial

A lot to learn, but probably the only way to properly get to the point where I can catalog files. PHP first (web-based), Python next (desktop), PostgreSQL (desktop, but growing on web), and then step-up from Python to C++. C++ is apparently more efficient than Python, but takes deeper knowledge. I'm not fond of Java until its Run Time Environment becomes more refined and stable.

Python flavors
The issue with Python is apparently that, once one gets to a GUI level with it, there are two main flavors. One, TKinter is essentially Python using the TK/GTK libraries. wxPython has its own library set. It seems that wxPython is more on a growth path than the older Tkinter. I read one description that said so. So far, I've been happy with wxPython.

Python/Postgresql
This does not seem problematic. It imports a module called psycopg2 which seems to have hooks for the DB. At this site, there is a simple tutorial about how to make the connection.

For programming, using arrays is the real.

Python/MySQL
Don't care, but have to learn it.

PHP
Difficult part is using arrays in best possible ways.