i'm trying extract firm's name sentences(like millions of sentences).
for example have bunch of pairs of firm's name , sentences below, (in excel files!
column 1 column 2 row 1 firm sentence 1 row 2 firm b sentence 2 row 3 firm c sentence 3
the examples of sentences below,
verizon accounted 12.6%, 17.8% , 17.9% of our net sales in fiscal 2010, 2009 , 2008, respectively.
sbc communications, inc., accounted 11.2% of our sales in fiscal 2002.
in fiscal 2006 att, bellsouth , cingular (who combined in merger) collectively represented approximately 14.9% of our net sales.
sales krone customers represented 21.8% of our net sales in fiscal 2004.
i want extract, verizon eg1), sbc communications, inc eg2), att, bellsouth , cingular eg3), krone eg4)
(hopefully, if extract year data , % of sales accounted firms, best!!)
however, there many variations sentences,
some of them contain regions' name,
- our emea region (europe, middle east , africa) accounted largest percentage of sales outside of north america , represented 20.6%, 19.0% , 22.6% of our net sales in fiscal 2008, 2007 , 2006, respectively.
and of them not contain proper noun
- we estimate products obtained outsourced manufacturers accounted approximately 19% of our net sales broadband
to achieve goal, i'm using stanfordtagger,
and extracting words tagged "organization"
however, recall , precision rate of tagger not quiet
and biggest problem took long time.
i think because standfordtagger loaded java scripts
it takes time every time use standfordtagger analyze every single sentence.
so, questions are
q1. there better way achieve goal?
(i used nltk pos tagger, however, tagger not provide information exact type of nnp(for example, organization, people). precision rate worse)
q2. there way analyze sentences @ once?
(i considered of adding sentences 1 string, however, division between sentences(and information extracted firm names) abstracted cannot link extracted data , firm's name on column 1..!
thanks reading long stupid question...! thanks!!
my code below,
java_path = "c:/program files (x86)/java/jdk1.8.0_131/bin/java.exe" os.environ['javahome'] = java_path st = stanfordnertagger('c:/python/stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz', 'c:/python/stanford-ner-2017-06-09/stanford-ner.jar') data2 = nltk.word_tokenize(sentence) tags = st.tag(data2) cp = nltk.regexpparser('organization: {<organization>+}') tree = cp.parse(tags) iob = nltk.chunk.tree2conlltags(tree) comcount = 0 = '' (word, chunk, iob_tag) in iob: if iob_tag == "b-organization": if comcount == 0: company.append(word) comcount += 1 = word else: company.append(a) = word elif iob_tag == "i-organization": if word == ',': = + word else: = + " " + word else : if comcount != 0: company.append(a)
Comments
Post a Comment