我需要使用NLTK模块进行一些文字处理,然后出现以下错误:AttributeError:'tuple'对象没有属性'isdigit'
有人知道如何处理此错误吗?
Traceback (most recent call last): File "preprocessing-edit.py", line 36, inpostoks = nltk.tag.pos_tag(tok) NameError: name 'tok' is not defined PS C:\Users\moham\Desktop\Presentation> python preprocessing-edit.py Traceback (most recent call last): File "preprocessing-edit.py", line 37, in postoks = nltk.tag.pos_tag(tok) File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\__init__.py", line 111, in pos_tag return _pos_tag(tokens, tagset, tagger) File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\__init__.py", line 82, in _pos_tag tagged_tokens = tagger.tag(tokens) File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\perceptron.py", line 153, in tag context = self.START + [self.normalize(w) for w in tokens] + self.END File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\perceptron.py", line 153, in context = self.START + [self.normalize(w) for w in tokens] + self.END File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\perceptron.py", line 228, in normalize elif word.isdigit() and len(word) == 4: AttributeError: 'tuple' object has no attribute 'isdigit'
import nltk with open ("SHORT-LIST.txt", "r",encoding='utf8') as myfile: text = (myfile.read().replace('\n', '')) #text = "program managment is complicated issue for human workers" # Used when tokenizing words sentence_re = r'''(?x) # set flag to allow verbose regexps ([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A. | \w+(-\w+)* # words with optional internal hyphens | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82% | \.\.\. # ellipsis | [][.,;"'?():-_`] # these are separate tokens ''' lemmatizer = nltk.WordNetLemmatizer() stemmer = nltk.stem.porter.PorterStemmer() grammar = r""" NBAR: {* } # Nouns and Adjectives, terminated with Nouns NP: { } { } # Above, connected with in/of/etc... """ chunker = nltk.RegexpParser(grammar) tok = nltk.regexp_tokenize(text, sentence_re) postoks = nltk.tag.pos_tag(tok) #print (postoks) tree = chunker.parse(postoks) from nltk.corpus import stopwords stopwords = stopwords.words('english') def leaves(tree): """Finds NP (nounphrase) leaf nodes of a chunk tree.""" for subtree in tree.subtrees(filter = lambda t: t.label()=='NP'): yield subtree.leaves() def normalise(word): """Normalises words to lowercase and stems and lemmatizes it.""" word = word.lower() word = stemmer.stem_word(word) word = lemmatizer.lemmatize(word) return word def acceptable_word(word): """Checks conditions for acceptable word: length, stopword.""" accepted = bool(2 <= len(word) <= 40 and word.lower() not in stopwords) return accepted def get_terms(tree): for leaf in leaves(tree): term = [ normalise(w) for w,t in leaf if acceptable_word(w) ] yield term terms = get_terms(tree) with open("results.txt", "w+") as logfile: for term in terms: for word in term: result = word logfile.write("%s\n" % str(word)) # print (word), # (print) logfile.close()
Ramtin M. Se.. 5
另一种简便的方法是更改此部分:
tok = nltk.regexp_tokenize(text, sentence_re) postoks = nltk.tag.pos_tag(tok)
并将其替换为nltk标准单词标记器:
toks = nltk.word_tokenize(text) postoks = nltk.tag.pos_tag(toks)
从理论上讲,性能和结果之间应该没有太大差异。
另一种简便的方法是更改此部分:
tok = nltk.regexp_tokenize(text, sentence_re) postoks = nltk.tag.pos_tag(tok)
并将其替换为nltk标准单词标记器:
toks = nltk.word_tokenize(text) postoks = nltk.tag.pos_tag(toks)
从理论上讲,性能和结果之间应该没有太大差异。