Recherche en utilisant le module whoosh

**thais781** · 08/02/2021, 16h45

Bonjour,

Est ce que qqun d'entre vous connait le module whoosh en python ?
Il permet d'indexer des fichiers et ensuite de faire de recherches dedans.

Pour ceux qui connaissent, j'ai un soucis lorsque ma query comporte des caractères spéciaux : par exemple si je recherche le langage "C++"

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
 
    iy = open_dir(my_index_path)
    with iy.searcher(weighting=scoring.Frequency) as searcher:
        query = QueryParser("content", iy.schema).parse("C++")
        print(query)

Je me retrouve avec "<_NullQuery>" en résultat

Alors que par exemple pour "Java", c'est impeccable ...

Une petite idée ?

PS : J'ai essayé avec "C\+\+" mais même résultat ....

Merci d'avance pour votre aide et à votre dispo pour faire des tests ....

Thais

**jurassic pork** · 09/02/2021, 08h23

hello,
d'après ce que j'ai pu comprendre l'analyseur standard de whoosh en utilisant un regextokenizer élimine tous les caractères de ponctuation (dont le + aussi certainement):
Pour éviter ceci il y a deux solutions :

1- Utiliser un KEYWORD à la place du champ TEXT , parce qu'il n'utilise pas le regextokenizer mais on ne pourra pas alors chercher une phrase de plusieurs éléments.
2- Utiliser un champ TEXT avec un StandardAnalyzer et un RegexTokenizer personnalisé.

Pour la première solution si j'ai un schéma comme celui-ci :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
 
    '''
    Schema definition: title(name of file), path(as ID), content(indexed
    but not stored),textdata (stored text content)
    '''
schema = Schema(title=TEXT(stored=True),path=ID(stored=True),\
                content=TEXT,textdata=TEXT(stored=True))

la recherche suivante dans des fichiers contenant c++ ne donnera rien :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
qp = QueryParser('textdata', schema=idx.schema)
query = qp.parse("c++")

Par contre avec ce schéma-ci :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
schema = Schema(title=TEXT(stored=True),path=ID(stored=True),\
                content=TEXT,textdata=KEYWORD(stored=True))

la recherche me donnera bien tous les fichiers qui contiennent c++

Ami calmant, J.P

**thais781** · 09/02/2021, 10h59

Bonjour JP

Et merci pour ton aide.
Je dois loupé un truc quelque part dans la construction de mon schéma puisque je me retrouve avec cette réponse ;-)

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
 
textdata:C++
<Top 0 Results for Term('textdata', 'C++') runtime=6.432802183553576e-05>

En résumé voici je que je fais (mon soucis est peut être dans le texte en rouge ??) :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#Initialisation / Constuction
my_index_path = "/users/.....whoosh_dir/"

schema = Schema(title=TEXT(stored=True),path=ID(stored=True), content=TEXT,textdata=KEYWORD(stored=True))
ix = create_in(my_index_path, schema)
writer = ix.writer()

liste_txt_files = [filename for filename in os.listdir("/users/.....txt_files/")]
liste_txt_files = ["/users/.....txt_files/" + x for x in liste_txt_files]

for path in liste_txt_files:
    print(path)
    with open(path, 'r') as fp: writer.add_document(title=os.path.basename(path), path=path, content=fp.read(), textdata=fp.read())
writer.commit()

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
 
#Recherche
my_index_path = "/users/.....whoosh_dir/"
query_str = "C++"
 
my_return_string = ""
iy = open_dir(my_index_path)
with iy.searcher(weighting=scoring.Frequency) as searcher:
    qp = QueryParser('textdata', schema=iy.schema)
    query = qp.parse(query_str)    
    print(query)
    results = searcher.search(query,limit=None)
    print(results)

Merci pour ton aide

**jurassic pork** · 09/02/2021, 12h11

hello,
A toi de voir ce qui ne va pas dans ton code. Avec celui-ci :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import os
import os.path
from whoosh import index
from whoosh import scoring
from whoosh import highlight
from whoosh.fields import ID, TEXT,KEYWORD,  Schema
from whoosh.reading import TermNotFound
from whoosh.qparser import QueryParser
 
def createSearchableData(indexdir, root):   
    '''
    Schema definition: title(name of file), path(as ID), content(indexed
    but not stored),textdata (stored text content)
    '''
    schema = Schema(title=TEXT(stored=True),path=ID(stored=True),\
              content=TEXT,textdata=KEYWORD(stored=True))
    if not os.path.exists(indexdir):
        os.mkdir(indexdir)
    # Creating a index writer to add document as per schema
    ix = index.create_in(indexdir,schema)
    writer = ix.writer()
 
    filepaths = [os.path.join(root,i) for i in os.listdir(root)]
    for path in filepaths:
        fp = open(path,'r',  encoding='utf-8')
        print(path)
        text = fp.read()
        writer.add_document(title=os.path.basename(path), path=path,\
          content=text,textdata=text)
        fp.close()
    writer.commit()
    return ix
if __name__ == '__main__':
    #init
    my_index_path = "d:/temp/whoosh_dir/"
    search_dir = "d:/temp/txt_files/"
    iy = createSearchableData(my_index_path, search_dir)
    #Recherche
    query_str = "C++"
 
    my_return_string = ""
   # iy = index.open_dir(my_index_path)
    with iy.searcher(weighting=scoring.Frequency) as searcher:
        qp = QueryParser('textdata', schema=iy.schema)
        query = qp.parse(query_str)    
        print(query)
        results = searcher.search(query,limit=None)
        print(results)
        for hit in results:
            print("  * {}".format(hit['title']))

j'obtiens bien le nombre de fichiers qui contiennent C++

Ami calmant, J.P

**thais781** · 09/02/2021, 17h16

Genial ca fonction top ;-)

Merci beaucoup

Recherche en utilisant le module whoosh [Python 3.X]

Python

Vue hybride

Discussions similaires

Partager

Partager