Headline

lexer-based text processing in Language:Java

Characteristics

A simple custom-made lexer is used to process a text-based representation of companies. The lexer uses a lookahead of 1. The lexer reports all tokens including whitespace. Such processing implements Feature:Parsing. Feature:Total is implemented by means of finding token sequences consisting of keyword "salary" followed by a number while ignoring whitespace in between. (Just looking for a number would be sufficient for the situation at hand because numbers are used for salaries only, but the extra test makes the point that ad hoc tests may be needed when lexers are used for data processing.) Feature:Cut copies lexemes to an output stream while modifying salaries. The lexemes for whitespace token transport layout from input to output. Such processing implements Feature:Unparsing.

Illustration

The data model is implemented as plain textual files:

company "ACME Corporation" { department "Research" { manager "Craig" { address "Redmond" salary 123456 } employee "Erik" { address "Utrecht" salary 12345 } employee "Ralf" { address "Koblenz" salary 1234 } } department "Development" { manager "Ray" { address "Redmond" salary 234567 } department "Dev1" { manager "Klaus" { address "Boston" salary 23456 } department "Dev1.1" { manager "Karl" { address "Riga" salary 2345 } employee "Joe" { address "Wifi City" salary 2344 } } } } }

Feature:Parsing is implemented using the helper class Recognizer to enable step-by-step lexing:

public class Parsing {

    public static Recognizer recognizeCompany(String in) throws IOException {
        Recognizer recognizer = new Recognizer(in);
        return recognizer;
    }

}

Feature:Unparsing demonstrates the use of the Recognizer to execute semantic actions (only write lexemes) during Feature:Parsing

/**
 * For clarification, this is precise copy and
 * only shows the idea of Unparsing (noop copy).
 */
public class Unparsing {

    public static void copy(String in, String out) throws IOException {
        Recognizer recognizer = recognizeCompany(in);
        Writer writer = new OutputStreamWriter(new FileOutputStream(out));
        String lexeme = null;
        Token current = null;
        while (recognizer.hasNext()) {
            current = recognizer.next();
            lexeme = recognizer.getLexeme();
            // noop
            // write
            writer.write(lexeme);
        }
        writer.close();
    }

}

Feature:Total and Feature:Cut are implemented using Feature:Parsing with semantic actions:

public class Total {

	private double total = 0;
	
	public double getTotal() {
		return total;
	}
	
	public Total(String s) throws FileNotFoundException {
		Recognizer recognizer = new Recognizer(s);
		Token current = null;
		Token previous = null;
		while (recognizer.hasNext()) {
			current = recognizer.next();
			if (current == FLOAT && previous == SALARY) 
				total += Double.parseDouble(recognizer.getLexeme());
			if (current!=WS)
				previous = current;
		}
	}
	
}
public class Cut {
	
	public Cut(String in, String out) throws IOException {
		Recognizer recognizer = new Recognizer(in);
		Writer writer = new OutputStreamWriter(new FileOutputStream(out));
		Token current = null;
		Token previous = null;
		String lexeme = null;
		while (recognizer.hasNext()) {
			
			current = recognizer.next();
			lexeme = recognizer.getLexeme();

			// Cut salary in half
			if (current == FLOAT && previous == SALARY)
				lexeme = Double.toString(
							(Double.parseDouble(recognizer.getLexeme())
								/ 2.0d));

			// Copy possibly modified lexeme
			writer.write(lexeme);

			if (current!=WS)
				previous = current;
		}
		writer.close();
	}
}

Test cases are implemented for all Namespace:Features.

Relationships

For plain syntax checking with Technology:ANTLR see Contribution:antlrAcceptor.

For lexer-based text processing in pure Language:Java see Contribution:javaScanner.

For lexing/tokenization with Technology:ANTLR see Contribution:antlrLexer.

For a custom made lexer in pure Language:Java see Contribution:javaLexer.

For parsing with semantic actions with Technology:ANTLR see Contribution:antlrParser.

For recursive-descent parsing in pure Language:Java] see Contribution:javaParser.

For parser combinators in pure Language:Java] see Contribution:javaParseLib.

For object/text mapping from test to companies with Technology:ANTLR see Contribution:antlrObjects.

For object/text mapping from text to trees with Technology:ANTLR see Contribution:antlrTrees.

Architecture

The contribution follows a standardized structure:

  • inputs contains input files for tests
  • src/main/java contains the following packages:
  • src/test/java contains the following packages:

Usage

This contribution uses Technology:Gradle for building. Technology:Eclipse is supported.

See https://github.com/101companies/101simplejava/blob/master/README.md

Metadata


There are no revisions for this page.

User contributions

    This user never has never made submissions.

    User edits

    Syntax for editing wiki

    For you are available next options:

    will make text bold.

    will make text italic.

    will make text underlined.

    will make text striked.

    will allow you to paste code headline into the page.

    will allow you to link into the page.

    will allow you to paste code with syntax highlight into the page. You will need to define used programming language.

    will allow you to paste image into the page.

    is list with bullets.

    is list with numbers.

    will allow your to insert slideshare presentation into the page. You need to copy link to presentation and insert it as parameter in this tag.

    will allow your to insert youtube video into the page. You need to copy link to youtube page with video and insert it as parameter in this tag.

    will allow your to insert code snippets from @worker.

    Syntax for editing wiki

    For you are available next options:

    will make text bold.

    will make text italic.

    will make text underlined.

    will make text striked.

    will allow you to paste code headline into the page.

    will allow you to link into the page.

    will allow you to paste code with syntax highlight into the page. You will need to define used programming language.

    will allow you to paste image into the page.

    is list with bullets.

    is list with numbers.

    will allow your to insert slideshare presentation into the page. You need to copy link to presentation and insert it as parameter in this tag.

    will allow your to insert youtube video into the page. You need to copy link to youtube page with video and insert it as parameter in this tag.

    will allow your to insert code snippets from @worker.