说我有一串话:'a b c d e f'
.我想从这个字符串生成一个多字词的列表.
字顺序很重要.'f e d'
不应从上面的例子中生成该术语.
编辑:此外,不应跳过单词. 'a c'
,或者'b d f'
不应该生成.
我现在拥有的:
doc = 'a b c d e f' terms= [] one_before = None two_before = None for word in doc.split(None): terms.append(word) if one_before: terms.append(' '.join([one_before, word])) if two_before: terms.append(' '.join([two_before, one_before, word])) two_before = one_before one_before = word for term in terms: print term
打印:
a b a b c b c a b c d c d b c d e d e c d e f e f d e f
我如何使它成为一个递归函数,以便我可以为每个项传递一个可变的最大字数?
应用:
我将使用它来从HTML文档中的可读文本生成多字词.总体目标是对大型语料库(大约200万个文档)进行潜在的语义分析.这就是为什么保持单词顺序很重要(自然语言处理和诸如此类).
这不是递归的,但我认为它可以满足您的需求.
doc = 'a b c d e f' words = doc.split(None) max = 3 for index in xrange(len(words)): for n in xrange(max): if index + n < len(words): print ' '.join(words[index:index+n+1])
这是一个递归解决方案:
def find_terms(words, max_words_per_term): if len(words) == 0: return [] return [" ".join(words[:i+1]) for i in xrange(min(len(words), max_words_per_term))] + find_terms(words[1:], max_words_per_term) doc = 'a b c d e f' words = doc.split(None) for term in find_terms(words, 3): print term
这里是递归函数,有些解释变量和注释.
def find_terms(words, max_words_per_term): # If there are no words, you've reached the end. Stop. if len(words) == 0: return [] # What's the max term length you could generate from the remaining # words? It's the lesser of max_words_per_term and how many words # you have left. max_term_len = min(len(words), max_words_per_term) # Find all the terms that start with the first word. initial_terms = [" ".join(words[:i+1]) for i in xrange(max_term_len)] # Here's the recursion. Find all of the terms in the list # of all but the first word. other_terms = find_terms(words[1:], max_words_per_term) # Now put the two lists of terms together to get the answer. return initial_terms + other_terms