Bulk Import HTML into Joomla

I have a site I’m migrating from straight HTML to Joomla, and I have a bunch of HTML files I want to load into it. I wrote a perl script that automates all this. It reads the HTML file to determine the title of the page, it looks at the file’s modification date to determine the “publication date” for Joomla, and then it makes a MySQL database connection and executes the query.This is a quick script I hacked together in about an hour. It probably works, but it is not for the faint of heart. If you barely understand Joomla, and you’ve never looked at perl before, and you’re working from a Windows PC, go elsewhere. This will be a lot of trouble for you and I won’t help you. If you’re using a Mac or Linux, or you’re comfortable running Perl on Windows, this is pretty straightforward.

Preparation

  1. Download the script.
  2. Make sure you have the Perl prerequisites:
    • HTML::Parser
    • DBI
    • DBD::mysql
  3. You must edit the script itself and change about ten things. That sounds like a lot, but they’re all obvious and straightforward. You edit these 10 lines:

    $db::user      = "dbuser";
    $db::passwd    = "dbpass";
    $db::database  = "joomla";
    $db::hostname  = "localhost";
    $db::port      = "3306";
    $db::tablename = "jos_content";
    $j::state    = 1;
    $j::section  = 1;
    $j::category = 1;
    $j::creator  = 62;
    

The database stuff is obvious (you have to tell it what database server to connect to, what the database name is, the table name, and the credentials to use. The “state” is whether you want it published or not. I don’t know what the value needs to be to be “not published.” I’m going to guess 0. The category and section numbers should be easy to figure out from the category and section lists in your Joomla administrator interface. Finally, you need to figure out what your creator ID is. 62 is always admin. I don’t know how to figure out the right number for your particular user. Inspecting your MySQL database and looking at a post that is similar is a good way to find out these values.

Create a directory

This script takes exactly one directory name (not file names) as its one and only argument. Put each file that you want uploaded into the directory. If you want some files going into one category and other files going into another category, create several directories. Run the script on one directory, then edit it to change the category, and run it on the other directory. See? It’s crude, but effective.

Run the script

The script is called UploadToJoomla.pl. Invoke it (when you’re ready, by typing

perl UploadToJoomla.pl directory

(where directory is the location of the files to upload).

Limitations and Things to Know

  • It takes the file’s modification date (the date that you see when you run ls or DIR) and makes that the creation date for the article. I.e., in Joomla, that’s the date that the article will appear to have been written. If you do as I suggest above, and you make copies of files in different directories, you’ll create a bunch of new files whose modification date is right now. Moving files from place to place is the only safe way to preserve their modification dates. That’s mv on FreeBSD / UNIX / Linux, REN on DOS / Windows, and dragging and dropping stuff around on MacOS and Windows.
  • For those of you on a UNIX-like system, it does not handle symbolic links. They’re ignored.
  • It does not go into subdirectories. That was a conscious choice on my part. It would be quite easy to modify the script to check if the entry in a directory is another directory, and then decend into it by calling processDir(). That’s left as an exercise for the reader.
  • The script looks for the tag to determine the article title. If it doesn’t find that tag, then it just assigns a single word “Article” as the title and summary.</li> <li>The article Title and Summary will be the same. All of the file’s content goes into the Intro Text. There will be no additional text.</li> <li>The entire content of the HTML file, even bits that Joomla doesn’t want (e.g., <body> tags, <head> tags, etc.) will get sucked into the database. As far as I can tell, Joomla handles this very gracefully. None of those things come out when the article is displayed on the site.</li> </ul> <h2 id="closing-thoughts">Closing Thoughts</h2> <p>I don’t do a lot of programming for the public domain. Mainly I hack stuff until it works for me, then I move along. I hope this is helpful for people, but understand that it’s not production-quality software. If you have trouble, you can <a href="mailto:blog@filter.paco.to">email me,</a> but I might not be able to help you. Please realize that I literally never use Windows. So if you email with a problem about running it on Windows, I’ll be more lost than you are.</p> <p>The code is released under the Perl Artistic License, which pretty much means you can do lots of stuff with it as long as you keep it all free (as in beer) and open.</p> </div> <div class="row"> <div class="col-md-8"> <div class="mb-5"> <div class="li-x div-x post-meta"> <li class="pr-0"><a href="/tags/"><i class="fas fa-tags"></i></a></li> <div class="tags-sm"> <li><a href="/tags/joomla" role="button">joomla </a></li> <li><a href="/tags/perl" role="button">perl </a></li> <li><a href="/tags/database" role="button">database </a></li> <li><a href="/tags/mysql" role="button">mysql </a></li> </div> </div> </div> </div> </div> <div class="row pt-3"> <div class="col-md-6"> <a href=/2007/figuring-out-iphone-availability-at-all-stores-and-at-all-hours/ class="post-meta">Previous <div class="pt-2 pb-5 d-flex"> <i class="fas fa-angle-left text-grey font-weight-bold mr-2 active-color"></i> <span>Figuring out iPhone Availability at all stores and at all hours</span> </div> </a> </div> <div class="col-md-6 text-right" > <a href=/2007/what-is-a-game/ class="post-meta">Next <div class="pt-2 pb-5 flex-reverse"> <i class="fas fa-angle-right text-grey font-weight-bold ml-2 active-color"></i> <span>What is a Game?</span> </div> </a> </div> </div> </div> </div> </div> </main> <footer class="page-footer text-center font-small mt-4 wow fadeIn"> <div class="pb-2 mt-5 pt-5"> <a href="https://mastodon.org.uk/@paco" target="_blank" rel="noopener"><i class="fab fa-mastodon mr-3" aria-hidden="true"></i></a> <a href="https://infosec.exchange/@paco" target="_blank" rel="noopener"><i class="fab fa-mastodon mr-3" aria-hidden="true"></i></a> <a href="//github.com/pacohope " target="_blank" rel="noopener"><i class="fab fa-github mr-3" aria-hidden="true"></i></a> <a href="//linkedin.com/in/pacohope" target="_blank" rel="noopener"><i class="fab fa-linkedin-in mr-3" aria-hidden="true"></i></a> <a href="//twitter.com/pacohope" target="_blank" rel="noopener"><i class="fab fa-twitter mr-3" aria-hidden="true"></i></a> <a href="//youtube.com/pacohope" target="_blank" rel="noopener"><i class="fab fa-youtube mr-3" aria-hidden="true"></i></a> <a href="mailto:blog@filter.paco.to"><i class="far fa-envelope-open mr-3" aria-hidden="true"></i></a> </div> <div class="container-fluid justify-content-center"> <div class="row py-4 text-grey font-small col-12 col-sm-12"> <div class="col12 col-sm-1"></div> <div class="col-12 col-sm-3 text-left"> <p>All views in this blog represent the personal views of Paco and do not necessarily reflect the views of anyone else.</p> </div> <div class="col-12 col-sm-4 text-center"> <p>© 2018 Paco Hope.<br/>All Rights Reserved except where explicitly stated.</p> </div> <div class="col-12 col-sm-3 text-right"> Web site delivered by <a href="https://aws.amazon.com/cloudfront/" target="_blank">CloudFront</a> and <a href="https://aws.amazon.com/s3/" target="_blank">S3</a>. Generated from source by <a href="https://gohugo.io/" target="_blank">Hugo</a>. Theme <a href='https://github.com/orianna-zzo/AllinOne' target="_blank">AllinOne</a> by <a href='https://github.com/orianna-zzo' target="_blank">Orianna</a>. </div> <div class="col12 col-sm-1"></div> </div> </div> </footer> <script type="text/javascript" src="/js/vendors/jquery/jquery-3.3.1.min.js"></script> <script type="text/javascript" src="/js/vendors/jquery/jquery.smooth-scroll.min.js"></script> <script type="text/javascript" src="/js/vendors/popper.min.js"></script> <script type="text/javascript" src="/js/vendors/holder.min.js"></script> <script type="text/javascript" src="/js/vendors-extensions/bootstrap4/bootstrap.js" ></script> <script type="text/javascript" src="/js/vendors/mdb/mdb.min.js"></script> <script type="text/javascript" src="/js/main.js"></script> <script src="/js/vendors/highlight.pack.js"> </script> <script>hljs.initHighlightingOnLoad();</script> <link rel="stylesheet" href="/css/mastodon.widget.css" /> <script type="text/javascript" src="/js/mastodon.widget.js"></script> <script> $(document).ready(function() { var mapi = new MastodonApi({ target_selector : '#pacotimeline', instance_uri : 'https://mastodon.org.uk', access_token : '75c8e092889e6958900c65cadf4e4fbf6cd312a49fdc8d1c17c698bdee8cc372', account_id : '57365', toots_limit : 5, pic_icon : '<i class="fa fa-picture-o"></i>', boosts_count_icon : '<i class="fa fa-retweet"></i>', favourites_count_icon : '<i class="fa fa-star"></i>' }); }); </script> <script type="text/javascript"> new WOW().init(); </script> </body> </html>