javaScanner

Headline

lexer-based text processing in Language:Java

A simple custom-made lexer is used to process a text-based representation of companies. The lexer leverages Java's basic Scanner API. Hence, it uses a delimiter pattern to chop the input into candidate tokens; it then uses regular expressions to recognize specific tokens. The default delimiter pattern is used: whitespace. This also means that whitespace itself is not reported as a token. Feature:Total is implemented by means of finding token sequences consisting of keyword "salary" followed by a number. (Just looking for a number would be sufficient for the situation at hand because numbers are used for salaries only, but the extra test makes the point that ad hoc tests may be needed when lexers are used for data processing.) Feature:Cut copies lexemes to an output stream while modifying salaries and performing some ad hoc pretty printing. Such a combination of lexer and pretty printing implements Feature:Parsing and Feature:Unparsing.

Given java.util.Scanner's approach to scanning with its reliance on delimiters, it is not straightforward to support proper string literals. The problem is that the straightforward token delimiter, whitespace, can also occur inside (proper) strings. No other definition of delimiter, not even a dynamically changing definition seem to be applicable here. Hence, the present implementation simply does not allow spaces in string literals---which is clearly a major limitation.

Note: Because of this issue, this is essentially a suboptimal implementation. See Contribution:javaLexer for a more robust lexer-based implementation in Java.

Illustration

The data model is implemented as plain textual files:

company "ACMECorporation" { department "Research" { manager "Craig" { address "Redmond" salary 123456 } employee "Erik" { address "Utrecht" salary 12345 } employee "Ralf" { address "Koblenz" salary 1234 } } department "Development" { manager "Ray" { address "Redmond" salary 234567 } department "Dev1" { manager "Klaus" { address "Boston" salary 23456 } department "Dev1.1" { manager "Karl" { address "Riga" salary 2345 } employee "Joe" { address "WifiCity" salary 2344 } } } } }

Feature:Parsing is implemented using the helper class Recognizer to enable step-by-step lexing:

public class Parsing {

    public static void parse(String file) throws FileNotFoundException, RecognitionException {
        Recognizer lexer = new Recognizer(file);
        lexer.lexall();
    }

}

Feature:Unparsing demonstrates the use of the Recognizer to execute semantic actions (only write lexemes) during Feature:Parsing

public class Unparsing {

    // Unparsing is implemented as part of Cut

}

Feature:Total and Feature:Cut are implemented using Feature:Parsing with semantic actions:

    public static double total(String s) throws FileNotFoundException {
        double total = 0;
        Recognizer recognizer = new Recognizer(s);
        Token current = null;
        Token previous = null;
        while (recognizer.hasNext()) {
            current = recognizer.next();
            if (current==FLOAT && previous==SALARY) 
                total += Double.parseDouble(recognizer.getLexeme());
            previous = current;
        }
        return total;
    }

    public void cut(String in, String out) throws IOException {
        recognizer = new Recognizer(in);
        writer = new OutputStreamWriter(new FileOutputStream(out));
        Token current = null;
        Token previous = null;
        String lexeme = null;
        while (recognizer.hasNext()) {
            current = recognizer.next();
            lexeme = recognizer.getLexeme();

            // Cut salary in half
            if (current==FLOAT && previous==SALARY)
                lexeme = Double.toString(
                            (Double.parseDouble(recognizer.getLexeme())
                                / 2.0d));

            // Adjust indentation
            if (current==OPEN) right();
            if (current==CLOSE) left();

            // Add linebreaks
            if (current==DEPARTMENT
            ||  current==MANAGER
            ||  current==EMPLOYEE
            ||  current==ADDRESS
            ||  current==SALARY
            ||  current==CLOSE) {
                nl();
                indent();
            }

            // Copy possibly modified lexeme
            write(lexeme + " ");

            previous = current;
        }
        writer.close();
    }

Test cases are implemented for all Namespace:Features.

Relationships

For plain syntax checking with Technology:ANTLR see Contribution:antlrAcceptor.

For lexer-based text processing in pure Language:Java see Contribution:javaScanner.

For lexing/tokenization with Technology:ANTLR see Contribution:antlrLexer.

For a custom made lexer in pure Language:Java see Contribution:javaLexer.

For parsing with semantic actions with Technology:ANTLR see Contribution:antlrParser.

For recursive-descent parsing in pure Language:Java] see Contribution:javaParser.

For parser combinators in pure Language:Java] see Contribution:javaParseLib.

For object/text mapping from test to companies with Technology:ANTLR see Contribution:antlrObjects.

For object/text mapping from text to trees with Technology:ANTLR see Contribution:antlrTrees.

Architecture

The contribution follows a standardized structure:

inputs contains input files for tests
src/main/java contains the following packages:
- org.softlang.company.features for implementations of Functional requirements.
  - org.softlang.company.features.recognizer for helper classes.
src/test/java contains the following packages:
- org.softlang.company.tests for Technology:JUnit test cases for Namespace:Features.