# This script is based on two scripts, made by Jarle Ebeling and Vladislav Dorokhin. # The idea of the script is to collect in one place the whole functionality we need to prepare a text for the alignmemt. # Thus, the script contains a component removing trash symbols (such as trailing spaces), a tagging component (for
# and tagging), a component which adds id to
tag and tag [and its own xml-validator (as we don't need to
# run Oxygen or other console validator to validate xml) - unfortunately, it's not done yet].
# If you install free ActivePerl (http://www.activestate.com/activeperl) which makes the .pl-files executable, you can
# run this script with one click only. That means you don't need to use the command line or enter any arguments (including
# file name) to process your text.
# Usage. This script takes the file with a plain text at the directory it is itself located in. If there are several
# plain text files the script takes the first one in alphabetical order. Obviously, if there are no any txt-files
# in the directory the script will do nothing. On the next step the script makes an XML file and puts the header there,
# as well as the text which is already processed.
# Please note that the ID which will be generated for each tag has a strong dependence of the file name. The recommended
# format of the file name is: a) initials of the author, b) the number of his text in the corpus, c) letter "T" if this
# is a translation d) the letter which marks the language of the text. For example "EH1E" for the first text of Ernest
# Hemingway which is written in English. "EH1TR" for the same case with an exception that this is a Russian translation
# of Hemingway«s text.
# Please pay attention to these four things. They are really important!
# First. Your text file must be true UTF-8.
# Second. Correct file name (see above). The extension ".txt" is required. There must be only one txt file in
# the script directory.
# Third. XML-header. After the script has worked you must open an XML file with your text-editor and replace stars (***)
# in the header with the relevant information.
# Fourth.