Skip to main content

Lexical Analysis

tip

Tokenizer

  • Maximal match
  • Higher priority match

转移图

token nextToken(void) {
char c = getChar();
switch(c) {
case '<':
c = getChar();
switch (c) {
case '=':
return LE;
case '>':
return NE;
default:
rollback();
return LT;
}

case '=':
return EQ;
case '>':
c = getChar();
switch (c) {
case '=':
return GE;
case '<':
return NE;
default:
rollback();
return GT;
}
}
}

关键字处理

  • 根据 完美哈希算法(无冲突哈希函数) , 建立所有关键字对应的关键字完美哈希表
  • 读入有效标识符(字符串型)后, 查询关键字哈希表, 检查当前标识符是否为关键字
#define KEYWORD_MAX_LEN 10

hash_one(char *str, int len) {
unsigned int keyValue = 0;

for (int i = 0; i < len; i++) {
keyValue += str[i] * ((int)pow(33, len - i));
}

keyValue = (keyValue * 3 + 7) % KEYWORD_MAX_LEN;
return keyValue;
}
#define KEYWORD_HASH_SEED 131

hash_two(char *str, int len) {
unsigned int keyValue = 0,
hash = 0;

for (int i = 0; i < len; i++) {
hash = hash * KEYWORD_HASH_SEED + str[i];
}

keyValue = hash & 0x7fffffff;
return keyValue;
}

有限状态自动机

确定

Deterministic Finite Automaton (DFA):

  • Only a transition for a state with a input
  • No epsilon moves

M = (AlphaSet/InputSet, StateSet, currentState, FiniteStateSet, transferFunction)

A = {a, b}, SS = {0, 1, 2}, cS = 0, FS = {2},
transferFunction = {
(cS0, a) -> cS1, (cS0, b) -> cS0,
(cS1, a) -> cS2, (cS1, b) -> cS1,
(cS2, a) -> cS2, (cS2, b) -> cS2,
}

状态转移表

状态\字符ab
010
121
222

非确定

Nondeterministic Finite Automaton (NFA):

transferFunction 中的次态不确定/不唯一(为一个开集合):

  • Multiple transitions for a state with a input
  • can epsilon moves

(cS0, a) -> {cS1, cS2}

自动词法分析器

Thompson

RegExp --> NFA:

  • 直接构造基本 RegExp
  • 递归构造复合 RegExp
  • epsilon : i --epsilon--> f
  • RegExp : i --NFA(RegExp)--> f
  • 选择 : i --NFA(RegExp1)--> f, i --NFA(RegExp2)--> f
  • 连接 : i --NFA(RegExp1)--> m --NFA(RegExp2)--> f
  • 闭包 : i --epsilon--> m --epsilon--> f, m --RegExp--> m

子集构造

NFA --> DFA:

由 Thompson 算法生成的 NFA, 当且仅当输入为 epsilon 时, 次态不唯一

  • 将所有可达到次态作为一个集合 s, 视为单一次态 s
  • delta(Sigma) + epsilon-closure(深度/广度优先遍历找寻可达到次态边界)
DFA subset_construction(NFA nfa) {
s0 = eps_closure(n0);

StateSet += s0;
enqueue(s0);

while (work_queue != []) {
dequeue(s);

foreach (ch in InputSet) {
next_state = eps_closure(delta(s, ch));
Fn[s, ch] = next_state; // DFA 中的转移函数

if (next_state not in StateSet) {
StateSet += next_state;
enqueue(next_state);
}
}
}

return DFA(StateSet, Fn);
}

Hopcroft

最小化 DFA(数字逻辑中的最简状态表), 合并等价状态(等价类)

split(StateSet S) {
foreach (char ch) {
if (ch can split S) {
split S into S1, ..., Sk;
}
}
}

hopcroft(DFA) {
split all nodes into InitStateSet and FiniteStateSet (Two State Sets);

while (set is still changes) {
split(S);
}
}

DFA

有向图

转移表

  • 行: 现态
  • 列: 输入
  • 值: 次态/ERROR/-1

驱动代码: table 用于实现 switch/case, stack 用于实现最长匹配

next_token() {
state = 0;
stack = [];

while (state != ERROR) {
c = getChar();

if (state is ACCEPT/FINITE) {
clear(stack);
}

push(state);
state = table[state][c];
}

while (state is not ACCEPT/FINITE) {
state = pop();
rollback();
}
}