Preparing news articles downloaded from Factiva for corpus linguistics analysis in TXM

With the onset of the COVID-19 pandemic, colleagues in the Cambridge Geography Department got together to write a Remote Fieldwork booklet aimed at undergraduate geographers faced with the prospect of researching and writing their final-year dissertations from their desks only. I contributed a short piece on how to integrate corpus linguistic techniques to critical discourse analysis.

You can read that piece here: Resources for remote research in Human Geography

While it offers some good introductory advice and suggestions for further reading, it is a little short on the technical details. I write this tutorial series so that anyone wishing to experiment with corpus linguistic tools can get set up to do so, even with very little programming experience (just like me!). The hope is that after reading this, you’ll be able to put together a corpus of news articles with as little technical hassle as possible.

In this post, I focus specifically on how to download news articles from the Factiva database, and prepare them for analysis in the corpus-linguistic software TXM.

Install pre-requisites

For this particular method of news corpus creation, you need:

(1) Access to the Factiva database: https://global.factiva.com/

(2) A version of TXM installed on your computer, to do your analysis: http://textometrie.ens-lyon.fr/?lang=en

(3) Python installed on your computer

You can download the latest version of Python (3.9.2 in March 2021) from this page: https://www.python.org/downloads/

Instructions for installation can be found here: https://docs.python.org/3/using/index.html.

You also need to install a couple of packages, required to run the Python script that converts your HTML files into a format (TXT + CSV) that can be imported into TXM. Once python is installed, you can add the packages by opening your command prompt (windows) or terminal (mac/linux) and typing:

pip install pandas

and then

pip install lxml

NOTE: if the above installation commands do not work and you are on Mac OS, it’s likely because you already have version 2.7 of Python pre-installed (you can type python --version in your terminal to find out). If that’s the case, use the same commands but with pip3 instead.

Download articles from Factiva as indexed HTML files & merge them

This gif outlines how to download your articles from Factiva. Make sure to download them in the correct “indexed” format, and as HTML files.

Factiva only allows you to download articles in batches of 100. Do this as many times as is necessary. Save all the HTML files into the same folder.

You will then need to merge all your HTML files into a single file.

This online tool can do this simply for you: http://www.html-merge.com/. Simply drag and drop all the downloaded files, click merge, and download the file it produces. HTML Merge has limitations on the number of files you can merge, as well as the total size of the output. If you exceed these, you need an alternative method:

[alternative merging method]

Using your command prompt (Windows) or terminal (Mac/Linux), navigate to the directory where you’ve saved all of your html files. In my case, I’ve placed them in a directory (= folder) called “C:\Users\user\Documents\factiva\html”. The command for this is cd followed by the file path:

cd C:\Users\user\Documents\factiva\html

Then, in windows, type the following command. It will merge all of the separate html files into a single new one called factiva-all.html

copy *.html factiva-all.html

The equivalent command in the Mac/Linux terminal is

cat *.html >factiva-all.html

Convert HTML file to a format usable by TXM

To convert the HTML file into a usable format that can be imported into TXM, you can use this Python script kindly made for this specific purpose by Lane Atmore. Download it here: https://github.com/laneatmore/factiva_html_parser.

Select the version that works for your operating system (either Windows or Mac/Linus) and save the html_parser.py script file into the directory. In my case, I place it in the directory above the one that contains my merged HTML file, that is : “C:\Users\user\Documents\factiva\”

Then, open your command prompt (windows) or terminal (mac/linux). Navigate to the sub-directory where you have saved the HTML file. To do this just type cd followed by the file path, so in my case:

cd C:\Users\user\Documents\factiva\html

Then, run the parser! To do this just type python (or python3 if you’ve had to use pip3 earlier) followed by the path to the html_parser.py file, then followed by the path to the merged HTML file, which I’ve called factiva-all.html

In my case:

python C:\Users\user\Documents\factiva\html_parser.py  C:\Users\user\Documents\factiva\html\factiva-all.html

And that should do it!

(If you have a large database with hundreds or thousands of articles, it might take a few minutes)

You should now have, in the “C:\Users\user\Documents\factiva\html\” directory, a TXT file for each article in the HTML file, numbered from 0 to however many articles there are in the HTML file.

There should also be a metadata.csv file that contains key information about each article: an identification number (starting at 0 and corresponding to the numbers of the TXT files), the article headline, author/byline, publication date, news source, and wordcount.

You can now import the database into TXM using the TXT + CSV import module.