Saturday, June 1, 2019

XML information retrieval and display (python, XSL)

In most cases, we want to retrieve unformatted XML files, create a stylesheet (XSL, XSD), and view the XML file in a browser.

In a few cases, we may want to automate reading-in the XML (eg. w/Python), strip key information, and place the information into a database (eg. PostgreSQL). We can then create a template for recalling the XML info into a browser.

The simple display situation is similar to what LEO agencies do forensically for Court. The latter case of data retreival is more similar to what intelligence agencies do with electronic records. There's overlap of course. At any rate, citizens are forced to almost create their own wheel on this issue, probably b/c there's zero government/business incentive to allow citizens to access to information (since c. 2001).


In my computer filing system, I put each XML project in a folder. Because inside each project, there may be several different applications used as well as the HTML, XML, XSL files. There's too large a variety of software -- python, postgresql,etc -- used to store project under applications.

1. stylesheet solution (Geany)

We can add style information into an XML file, same as we do inside an HTML file if we add a "style" section in its header. Or we can have an XML file call to a separate style-centric XML file, which we usually re-suffix as an XLS file to point out its style usage. The latter is similar to putting all HTML formatting into a separate CSS file.

We can use any plain text editor, I often use Geany. We want our displayed HTML, XSL, or XSD files to be standalone, so we don't run SMS files through outside servers to re-format. Tightly constructed headers are necessary for this.

  • XML (input): the basic XML file which may not be conceptually clear or human readable, possibly with many attributes
  • XSL: the map we create formatting these tags, similar to a CSS in HTML, ie called or married to the document. More here. Many browsers consider XSL a security risk if placed in same directory as XML.
    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet version="1.0">
    Document
    </xsl:stylesheet>
    An optimization will be to place the XSL style information into the HTML or XML header.
  • XSD
  • XML (output): our finished file structured in a way we can read and with a dictionary for future use.

XSLT to HTML (15:57) Brandon Jones, 2018. Skip first 4 minutes. Uses an intermediate step of an XSL file. 5:30: "marry" XML to an XSD to tell it how to display, but actually uses an XSL for the browser?
attribute notes (19:06) Kent D. Lee, 2013. important XML header information first couple of minutes. TCX file is Garmin proprietary XML.

2. python manipulation

python attribute retrieval (19:06) Kent D. Lee, 2013. teacher at luther.edu uses a proprietary TCX Garmin file, their tagged XML, and harvests information for his own use. 8:50 how to retrieve dictionary of attributes for an element (tag).
XSLT to HTML (15:57) Brandon Jones, 2018. Skip first 4 minutes. Uses an intermediate step of an XSL file. 5:30: "marry" XML to an XSD to tell it how to display, but actually uses an XSL for the browser?
basic extraction w/python (14:51) Extreme Automation Kamal Girdher, 2019. inflected English narrator. simplistic extraction of tree items.

XML considerations

SMS B&R's backups are standalone XML's with two label types: root labels ("smses"), and child labels. There is of course only one root label per XML file, but the child ("sms") labels number in the hundreds, depending on how many texts were backed-up. Each SMS or MMS is another child label, with its data stored in child label's attributes. I must write Python and/or XSL which harvests the information in the child labels, and then chronologically assembles texts beneath the correct phone number(s).

coding considerations

In Python it's trivial to create an output ASCII file; just add "w+" to some print statement, then close the file. For the data extraction though, there are thousands of approaches. Given my limited Python ability, I considered roughly two:

Schema1: XML is in date order. Open the XML, read all the cell numbers into a set (removes duplicates). First cell number in the set is tested against each line in the XML: if cell numbers match, the XML row is written to the text file. Second cell number in the set is tested against each line in the XML: if cell numbers match, the XML row is written to the text file. And so on, through the cell number set. Next, repeat this process with the second cell number in the set. And the third. Continue until each cell numbers in the set have been matched against the XML, and written to the output file.

Schema2: XML is in date order. Keeping date order, sort all instances of same number. So number order, then date order within each grouping of numbers. Write all of this to a file with a header between each grouping of numbers.

Schema 1

1) Create a set of all the cell numbers. We want a set, not a list, b/c Python sets exclude any non-unique items, unlike lists. We only want one instance of each cell number. Sets are iterable, but have no indices.
import os
import xml.etree.ElementTree as et

# read-in the XML file and get the root
xml_file="/home/foo/py/sms-20190303033634.xml"
tree=et.parse(xml_file)
root=tree.getroot()

# iterate through the XML file and create a set of
# telephone numbers
nu = set()
for sms in root.findall('sms'):
    numb=sms.get('address')[-10:]
    nu.add(numb)

XSL

Links: StackOverflow :: 2 conditions select (union) :: YouTube (8:17)
The Chromium team at Google apparently disallows XSL files to be in the same directory with an XML. I wasted a day writing reliable XSL files but wouldn't open the XML in Chromium. Finally, I remembered to tap F12 and look for errors: the "security" violation warning was obvious immediately. This Chromium XSL policy is a heated bug discussion on the Web. Whatever, I simply downloaded Firefox and the XSL displayed the XML. The basic template for an XSL is below:
<?xml version='1.0' encoding='UTF-8'?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match= "[rootlabel]">

[Code to transform the text]

</xsl:template>
</xsl:stylesheet>
A line is also added to the source XML file pointing to this XSL file, similarly to how HTML and CSS are used together. HTML is just one type of XML code.

project specific

As noted at the top, I wanted XSL-transformed XML to approximate the EMT text format. I started to build the XSL: 1) extract phone numbers, 2) test each XML row against a phone number 3) write to text, 4) repeat with the next phone number, until all phone numbers were tested and printed.

background

There are certain benign actions which the powers-that-be nevertheless make difficult for citizens, either by security design or by the reverse invisible hand of dmca or other rent-seekers. One of these is an easy-to-read record of SMS's.

SMS retention is trivial for forensics and surveillance; they use SMS's regularly against citizens. What smells bad to me about that? 1) citizens should be on an equal playing field. 2) SMS retention should be easy (ASCII) for any citizen, 3) SMS retention used to be easy and mysteriously became difficult. For example, there used to be a simple app in the Android Play Store called Email My Texts. EMT backed-up SMS's and MMS's in a simple, intuitive, ASCII format (screenshot below). As you can see, EMT did not back-up MMS attachments, but it added a line of text to media MMS's noting that a media file had been attached. The ASCII format made the backup file on Dropbox easily searchable via a browser.

Try to find something like this nowadays that doesn't go through a questionable server somewhere. The closest you can get today is to download from your phone in some XML format. This of course means you'll have all the "user-friendlyiness" of inscrutable XML tags and no way to format your files. Let's see if we can find some way to harvest all the root tags and reformat it into eg, an html page, etc. Probably we can't. This will be a time consuming process with a the necessary addition of a huge CSS file or a an immense "style" header.

In about 2018, EMT disappeared from the Android store. The developer's website noted EMT was discontinued, but did not provide an explanation. I could find no apps in 2019 which produced a similar result to EMT -- the current batch of back-up apps produce annoying formats: PDF's, XML, CSV, proprietaries. All of these are undesirable compared to ASCII. After some research, the format which seemed the most simple to convert to ASCII seemed to be XML. So, a few weeks ago, I purchased the ad-free XML backup client SMS Backup and Restore Pro ($5). Its high price seemed worthy insofar as, by avoiding ads, I could likely avoid Google or App developers parsing my private texts for ad relevance.

As for XML, I figured I could create a layman/amateurish Python script or XSL file to parse the XML and convert it to a text file. I had no intention to use, eg., the timeit module or code optimize.

objective

Simple: translate SMS B&R's XML backup file into an ASCII format, similar to that in the screenshot above.

No comments: