python - How to get a syntax tree with comments? -


i'm trying create documentation generator several languages. need ast, in order known that, instance, comment class , 1 method of class.

i started write simple python code display tree recursively looking on it:

import sys import antlr4 ecmascriptlexer import ecmascriptlexer ecmascriptparser import ecmascriptparser  def handletree(tree, lvl=0):     child in tree.getchildren():         if isinstance(child, antlr4.tree.tree.terminalnode):             print(lvl*'│ ' + '└─', child)         else:             handletree(child, lvl+1)  input = antlr4.filestream(sys.argv[1]) lexer = ecmascriptlexer(input) stream = antlr4.commontokenstream(lexer) parser = ecmascriptparser(stream) tree = parser.program() handletree(tree) 

and tried parse javascript code, antlr ecmascript grammar:

var = 52; // inline comment  function foo() {   /** foo documentation */   console.log('hey'); } 

this outputs:

│ │ │ │ └─ var │ │ │ │ │ │ └─ │ │ │ │ │ │ │ └─ = │ │ │ │ │ │ │ │ │ │ └─ 52 │ │ │ │ │ └─ ; │ │ │ └─ function │ │ │ └─ foo │ │ │ └─ ( │ │ │ └─ ) │ │ │ └─ { │ │ │ │ │ │ │ │ │ │ │ │ └─ console │ │ │ │ │ │ │ │ │ │ │ └─ . │ │ │ │ │ │ │ │ │ │ │ │ └─ log │ │ │ │ │ │ │ │ │ │ │ └─ ( │ │ │ │ │ │ │ │ │ │ │ │ │ │ └─ 'hey' │ │ │ │ │ │ │ │ │ │ │ └─ ) │ │ │ │ │ │ │ │ │ └─ ; │ │ │ └─ } └─ <eof> 

all comments ignored, because of presence of channel(hidden) in grammar.

after googling found this answer:

unless have compelling reason put comment inside parser (which i'd hear), should put in lexer.

so, why comments should not included in parser , how tree including comments?

so, why comments should not included in parser , how tree including comments?

if remove -> channel(hidden) rule multilinecomment

multilinecomment  : '/*' .*? '*/' -> channel(hidden)  ; 

then multilinecomment end in parser. then, each of parser rules need include these tokens allowed.

for example, take arrayliteral parser rule:

/// arrayliteral : ///     [ elision? ] ///     [ elementlist ] ///     [ elementlist , elision? ] arrayliteral  : '[' elementlist? ','? elision? ']'  ; 

since valid array literal in javascript:

[/* ... */ 1, 2 /* ... */ , 3 /* ... */ /* ... */] 

it mean you'd need litter parser rules multilinecomment tokens this:

/// arrayliteral : ///     [ elision? ] ///     [ elementlist ] ///     [ elementlist , elision? ] arrayliteral  : '[' multilinecomment* elementlist? multilinecomment* ','? multilinecomment* elision? multilinecomment* ']'  ; 

it become 1 big mess.

edit

from comments:

so it's not possible generate tree including comments antlr? there hacks or other libraries this?

and grosenberg's answer:

antlr provides convenience method task: bufferedtokenstream#gethiddentokenstoleft. in walking parse tree, access stream obtain node associated comment, if any. use bufferedtokenstream#gethiddentokenstoright trailing comment.


Comments