Saturday, September 25, 2021

csv, sms, postgres, dbeaver, python project?

I began this post 10 years ago only thinking about a LAPP or LAMP. Even that was too wide a scope. This post only covers importing a CSV into a stand-alone PostgreSQL instance. We can then take the data for additional processing or practice or whatever. Or for Business Intelligence (BI), since every database is an opportunity for BI meta-analysis. There are several pre-made tools for this, but they may not address our focus. They're still worth a look.

dash and plotly introduction  (29:20) Charming Data, 2020. A bit outdated, but the fundamental concepts still helpful.
PBX - on-site or cloud  (35:26) Lois Rossman, 2016. Cited mostly for source 17:45 breaks down schematically.
PBX - true overhead costs  (11:49) Rich Technology Center, 2020. Average vid, but tells hard facts. Asteriks server ($180) discussed.

file/record plan issues

We can save psql and sql scripts as *.psql and *.sql files respectively. If we use, eg DBeaver, it will generate SQL scripts, which we can subsequently transform, with some work, into PSQL scripts. We may wish to learn more Math also, though this can be difficult on one's own.

Back to our project. At the lowest levels, we need a system which

  • provides document storage. folders receive a number relevant to their path, and we save documents in them without changing the file names. Of course this has to account for updates of new file addition or any which are retired/deleted.
  • queries either on meta information (dates), or upon document content (word search)
  • reports on the database itself; its size, category structures, number of files pointed to, and so on. This is good BI app territory.
  • back-up configurable or cloud hosting
  • if office access, possible GUI to delimit employees queries. browser friendly desktop development required for a GUI (PyGObject - GTK, or QT), maybe even Colab.

Python is now flexible enough, especially through become advanced enough to establish client side connections with a database which used to require PHP.

Grafana apparently works well in a time-series but also could allow us perhaps display counts minutes or numbers of texts tied to various numbers. Pandas and even R have some visualization options.
Dash is another API similar to Grafana. We can use one or the other, or GraphQL.
Fluentd can help us if we need to manage several relevant logs around this database project.
Logs are somewhat specific, but prometheus monitors selected system traits - memory use, HDD access. It can also do this over HTTP in a disributed system, as described here. These kinds of time series integrate well with grafana.


No particular logo, but SMS related, being able to run some kind of programs against databases to find mean, mode, and standard deviation, even in a table. For example a report that showed average number of texts, word length, top 3 recipients and their tallies, etc. Easily run queries against a number and report to screen. Use browser to access DB.

SMS

A possible project for document managment may be to establish SMS message storage, query and retrieval. The data are row entries from a CSV, rather than disparate files with file names. A reasonable test case.

  • pacman postgresql dbeaver python. By using dbeaver, we don't have to install PHP and Pgadmin. Saves an extra server.
  • activate postgreql server
  • import CSV into Gnumeric and note column names and data types. This can even be codified with UML
  • populate database with correct column names and formats. Also check data file with text editor to verify comma numbers are accurrate, etc
  • Python script or manual dBeaver to import CSV file(s).
  • rclone backup of data files to Cloud

PostgreSQL - install and import CSV  (11:19) Railsware Product Academy, 2020. Also PgAdmin install and walkthrough.
PBX - on-site or cloud  (35:26) Lois Rossman, 2016. Cited mostly for source 17:45 breaks down schematically.
PBX - true overhead costs  (11:49) Rich Technology Center, 2020. Average vid, but tells hard facts. Asteriks server ($180) discussed.

1 & 2. postgresql install/activate (1.5 hrs)

Link: Derek Banas' 3.75 hr PostgreSQL leviathan

See my earlier post on the steps for this. I rated the process at 2 hrs at the time. However, with the steps open in a separate page, 1.5 hrs seems reasonable.

At this point, we should have a PostgreSQL server (cluster) operating on a stand-alone system out of our home directory, with our own username, eg "foo". We can CREATE or DROP databases as needed as we attempt to hit upon a workable configuration.

$ postgres -D /home/foo/postgres/data
$ psql -U foo postgres
postgres=#: create database example;
postgres=#: drop database example;
OR
postgres=#: drop if exists database example;
postgres=#: exit

It would be nice to have a separate cluster for each import attempt, so I don't have to carefully name databases, but I don't want to run "initdb" a bunch of times and create a different data directory for each cluster, which is required. So I'm leaving everything under "postgres" cluster and when I get the final set of databases I like, I'll do a script (eg. *.psql) or some UML so I can re-install the solution after deleting the test project.

3. csv import

Most things on a computer can be done 100 different ways so I started with the simplest -- psql and moved to other ways. This is all covered below. But my first step was to pick a CSV of backed-up SMS's for the process, and clean it.

3a. csv cleaning

I first selected "backup.csv" with 7 columns and 456 rows. I simply opened it with Gnumeric and had a look. I noted advertising in the first and last rows and stripped these rows from Geany. This left me a 7x454, with the first row the column titles. What's interesting here is some of the texts had hard returns in their message contents, so that there were 581 lines. I therefore made a second version, 7x10, with no returns in the message contents; "backup02.csv", for simplest processing.

3b. data types

The efforts below have taught me that we need to understand several basic data types. There were a lot of failures until then. I found this video and this post helpful. Matt was only making a table for a two column price list, but it gives us a start.

And here the more complex scenario Raghav skillfully addresses with multiple tables.

3c. monolithic import attempt

Without considering keys or any normalization, let's try to just bring in an entire CSV, along the lines of Matt's import above. Of note, the times were in the format, "2021-03-07 21:40:25", apparently in the timezone where the call occurred.

$ psql -U foo postgres
postgres=#: create database test01;
postgres=#: \c test01
test01=#: create table try01 (between VARCHAR, name VARCHAR(30), phone VARCHAR(20), content VARCHAR, date TIMESTAMP, mms_subject VARCHAR(10), mms_link VARCHAR(20));
test01=#: select * from try01;
test01=#: COPY try01 FROM '/home/foo/backup01.csv' DELIMITER ',' CSV HEADER;

This accurately brought in and formatted the data, however it also brought a first row that was all dashes, and I don't really need the 1st, 6th, and 7th columns. I looked at the PostgreSQL documentation for COPY. The columns look easiest to fix so i created a smaller table without them.

test01=#: create table try01 (name VARCHAR(30), phone VARCHAR(20), content VARCHAR, date TIMESTAMP);
test01=#: COPY try01 FROM '/home/foo/backup01.csv' DELIMITER ',' CSV HEADER;
4. dbeaver tweaks (1 hr)

The most important thing in DBeaver, after we've connected to our server, is to remember to R-Click on the server, go into settings and select "Show all Databases". If you forget this step, you will go insane. I didn't know about that step and... just do it. The other thing is a helpful "Test Connection" button down in the L corner.

Finding that some columns have dates, times, numbers, and do we want to use the telephone number as the primary key? Once we have a working concept, we'll want to UML it.

dbeaver - some basics  (16:23) QAFox, 2020. Begins about 10:00. Windoze install but still useful.
PBX - on-site or cloud  (35:26) Lois Rossman, 2016. Cited mostly for source 17:45 breaks down schematically.
PBX - true overhead costs  (11:49) Rich Technology Center, 2020. Average vid, but tells hard facts. Asteriks server ($180) discussed.

python

As noted above, however, we are now interested in letting Python do some of the scripting, so that we don't need two languages. To do this, we install PL/Python on the PostgreSQL side. Other options are available for other languages too -- for example if we want to run statistical "R" programs against a PostgreSQLdatabase, we'd install PL/R. Or we can write-out PL/pgSQL commands and put them into Python or R scripts if we wish.

On the Python side, we obtain modules from PyPi, such as Psycopg2 (also available in pacman repos in Arch). With PL/Python in the database, and Psycopg2 modules in our Python programs, our connections and commands to PostgreSQL become much simpler. And of course, one can still incorporate as much or as little PHP as they wish, and we'd still use judicious amounts of JavaScript in our Apache-served HTML pages. To summarize, in a LAMP we might want:

server-side

  1. PostgreSQL - open source database.
  2. Python - general purpose language, but with modules such as Psycopg2, we can talk to the database and also serve dynamic webpages in HTML, replacing PHP, or even create some GUI application which does this from our laptop.
  3. Apache - still necessary, since this is what interacts with the client. Pytahon or PHP feeds the dynamic content into the HTML, but Apache feeds these pages to the client (browser).
  4. PHP - still available, if we want to use it.

No comments: