关于CHESS模型的一些学习（1）

CHESS的原理

CHESS论文：[2405.16755] CHESS: Contextual Harnessing for Efficient SQL Synthesis

CHESS 的实现过程

（1）Information Retriever (IR) - 信息检索器

从数据库模式中提取与问题相关的实体和上下文。

The Information Retriever (IR) agent aims to retrieve the relevant entities (values in the database) and context (schema descriptions provided in the database catalog). To achieve this, we present scalable and efficient methods using locality-sensitive hashing to retrieve database values from millions of rows, leveraging keyword detection, and vector databases to extract contextual information from database catalogs. Our approach utilizes both semantic and syntactic similarities between the database content and the user’s query to enhance the retrieval quality.

（2）Schema Selector (SS) - 模式选择器

进一步筛选出与问题最相关的表和列。

The goal of the Schema Selector (SS) agent is to reduce the schema size by selecting only the necessary tables and columns required for generating the SQLquery. To achieve this, the SS agent is equipped with three tools, filter column, select tables, and select columns.

（3）Candidate Generator (CG) - 候选生成器

生成多个候选 SQL 查询。

The Candidate Generator (CG) is responsible for synthesizing SQL query that answers the question asked from the database. To accomplish this, the CG agent first calls the generate candidate query tool to generate candidate queries. It then executes these candidates on the database, checks the result, and identifies any faulty queries (those containing syntactic errors or empty result). For each faulty candidate, the agent repeatedly attempts to revise the query, until the issue is resolved or a maximum number of allowed revisions is reached.

（4）Unit Tester (UT) - 单元测试器

功能：通过自然语言单元测试评估生成的 SQL 查询。

The Unit Tester (UT) agent is responsible for selecting the most accurate SQL query from the pool of candidates generated by the CG agent. UT identifies the best candidate by 1) generating multiple unit tests that highlight differences between the candidate queries and 2) evaluating the candidates against these unit tests. It then assigns a score to each query based on the number of unit tests it passes, selecting the top-scoring candidate as the final SQL query for the given question.

实例分析

以california school 0为例，在bird测试集中:

1
2
3
4
5
6
7
8


{
    "question_id": 0,
    "db_id": "california_schools",
    "question": "What is the highest eligible free rate for K-12 students in the schools in Alameda County?",
    "evidence": "Eligible free rate for K-12 = `Free Meal Count (K-12)` / `Enrollment (K-12)`",
    "SQL": "SELECT `Free Meal Count (K-12)` / `Enrollment (K-12)` FROM frpm WHERE `County Name` = 'Alameda' ORDER BY (CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) DESC LIMIT 1",
    "difficulty": "simple"
  },

“question_id"是测试集问题编号，该测试集一共有1534个问题
“db_id"是数据库名称
“evidence"是给AI的提示，用于辅助生成结果
“SQL"是标准查询语句，用于获取正确的查询结果
“difficulty"是BIRD给出的问题难度评级，“simple”<“moderate”<“challenging”

通过california_school_0.json与california_school_0.log逐级分析：

首先用户给出提示词，在log中有记录：

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149


############################## Human at step revise_sql ##############################

Objective: Your objective is to make sure a query follows the database admin instructions and use the correct conditions.

Database Schema:    
CREATE TABLE frpm
(
	CDSCode TEXT not null primary key,
	`Academic Year` TEXT null,
	`County Code` TEXT null,
	`District Code` INTEGER null,
	`School Code` TEXT null,
	`County Name` TEXT null, -- examples: `Alameda`
	`District Name` TEXT null,
	`School Name` TEXT null, -- examples: `Alameda County Community`, `Alameda High`
	`District Type` TEXT null,
	`School Type` TEXT null,
	`Educational Option Type` TEXT null,
	`NSLP Provision Status` TEXT null,
	`Charter School (Y/N)` INTEGER null,
	`Charter School Number` TEXT null,
	`Charter Funding Type` TEXT null,
	IRC INTEGER null,
	`Low Grade` TEXT null,
	`High Grade` TEXT null,
	`Enrollment (K-12)` REAL null, -- description: Enrollment (K-12)
	`Free Meal Count (K-12)` REAL null, -- description: Free Meal Count (K-12)
	`Percent (%) Eligible Free (K-12)` REAL null,
	`FRPM Count (K-12)` REAL null, -- description: Free or Reduced Price Meal Count (K-12)
	`Percent (%) Eligible FRPM (K-12)` REAL null,
	`Enrollment (Ages 5-17)` REAL null, -- description: Enrollment (Ages 5-17)
	`Free Meal Count (Ages 5-17)` REAL null, -- description: Free Meal Count (Ages 5-17)
	`Percent (%) Eligible Free (Ages 5-17)` REAL null,
	`FRPM Count (Ages 5-17)` REAL null,
	`Percent (%) Eligible FRPM (Ages 5-17)` REAL null,
	`2013-14 CALPADS Fall 1 Certification Status` INTEGER null,
	foreign key (CDSCode) references schools (CDSCode),
);

CREATE TABLE satscores
(
	cds TEXT not null primary key,
	rtype TEXT not null,
	sname TEXT null, -- examples: `Alameda High`
	dname TEXT null, -- examples: `Alameda Unified`
	cname TEXT null, -- examples: `Alameda`
	enroll12 INTEGER not null,
	NumTstTakr INTEGER not null,
	AvgScrRead INTEGER null,
	AvgScrMath INTEGER null,
	AvgScrWrite INTEGER null,
	NumGE1500 INTEGER null,
	foreign key (cds) references schools (CDSCode),
);

CREATE TABLE schools
(
	CDSCode TEXT not null primary key,
	NCESDist TEXT null,
	NCESSchool TEXT null,
	StatusType TEXT not null,
	County TEXT not null, -- examples: `Alameda`
	District TEXT not null, -- examples: `Alameda County Office of Education`
	School TEXT null, -- examples: `Alameda County Community`, `Alameda High`
	Street TEXT null,
	StreetAbr TEXT null,
	City TEXT null, -- examples: `Alameda`
	Zip TEXT null,
	State TEXT null,
	MailStreet TEXT null,
	MailStrAbr TEXT null,
	MailCity TEXT null, -- examples: `Alameda`
	MailZip TEXT null,
	MailState TEXT null,
	Phone TEXT null,
	Ext TEXT null,
	Website TEXT null,
	OpenDate DATE null, -- examples: `1997-09-01`
	ClosedDate DATE null, -- examples: `1984-07-24`
	Charter INTEGER null,
	CharterNum TEXT null,
	FundingType TEXT null,
	DOC TEXT not null,
	DOCType TEXT not null,
	SOC TEXT null,
	SOCType TEXT null,
	EdOpsCode TEXT null,
	EdOpsName TEXT null,
	EILCode TEXT null,
	EILName TEXT null,
	GSoffered TEXT null, -- examples: `K-12`
	GSserved TEXT null, -- examples: `K-12`
	Virtual TEXT null,
	Magnet INTEGER null,
	Latitude REAL null,
	Longitude REAL null,
	AdmFName1 TEXT null,
	AdmLName1 TEXT null, -- examples: `Free`
	AdmEmail1 TEXT null,
	AdmFName2 TEXT null,
	AdmLName2 TEXT null,
	AdmEmail2 TEXT null,
	AdmFName3 TEXT null,
	AdmLName3 TEXT null,
	AdmEmail3 TEXT null,
	LastUpdate DATE not null, -- examples: `2015-07-01`
);

Database admin instructions:
1. When you need to find the highest or lowest values based on a certain condition, using ORDER BY + LIMIT 1 is prefered over using MAX/MIN within sub queries.
2. If predicted query includes an ORDER BY clause to sort the results, you should only include the column(s) used for sorting in the SELECT clause if the question specifically ask for them. Otherwise, omit these columns from the SELECT.
3. If the question doesn't specify exactly which columns to select, between name column and id column, prefer to select id column.
4. Make sure you only output the information that is asked in the question. If the question asks for a specific column, make sure to only include that column in the SELECT clause, nothing more.
5. Predicted query should return all of the information asked in the question without any missing or extra information.
7. For key phrases mentioned in the question, we have provided the most similar values within the columns denoted by "-- examples" in front of the corresponding column names. This is a crucial hint indicating the correct columns to use for your SQL query.
8. No matter of how many things the question asks, you should only return one SQL query as the answer having all the information asked in the question, seperated by a comma.
9. Using || ' ' ||  to concatenate is string is banned and using that is punishable by death. Never concatenate columns in the SELECT clause.
10. If you are joining multiple tables, make sure to use alias names for the tables and use the alias names to reference the columns in the query. Use T1, T2, T3, ... as alias names.
11. If you are doing a logical operation on a column, such as mathematical operations and sorting, make sure to filter null values within those columns.
12. When ORDER BY is used, just include the column name in the ORDER BY in the SELECT clause when explicitly asked in the question. Otherwise, do not include the column name in the SELECT clause.


Value Alameda that you used in the query appears in the column City of table schools.
Value Alameda that you used in the query appears in the column MailCity of table schools.
Value Alameda that you used in the query appears in the column County of table schools.
Value Alameda that you used in the query appears in the column cname of table satscores.



Question:
What is the highest eligible free rate for K-12 students in the schools in Alameda County?

Hint:
Eligible free rate for K-12 = `Free Meal Count (K-12)` / `Enrollment (K-12)`

Predicted query:
SELECT MAX(CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) FROM frpm WHERE `County Name` = 'Alameda' 

Query result:
[(1.0,)]

Please respond with a JSON object structured as follows (if the sql query is correct, return the query as it is):

{
    "chain_of_thought_reasoning": "Your thought process on how you arrived at the solution. You don't need to explain the instructions that are satisfied.",
    "revised_SQL": "Your revised SQL query."
}

Take a deep breath and think step by step to find the correct sqlite SQL query. If you follow all the instructions and generate the correct query, I will give you 1 million dollars.

AI通过分析用户的要求，得到如下结果：

1
2
3
4
5
6


############################## AI at step revise_sql ##############################

{
    "chain_of_thought_reasoning": "To find the highest eligible free rate for K-12 students in schools located in Alameda County, we need to calculate the ratio of `Free Meal Count (K-12)` to `Enrollment (K-12)` for each school and then determine the maximum value of this ratio. The database schema indicates that these fields are located in the `frpm` table, and the county information is stored in the `County Name` column. The query filters records where `County Name` is 'Alameda' to focus on schools in Alameda County. The calculation for the eligible free rate is performed by dividing `Free Meal Count (K-12)` by `Enrollment (K-12)`, and the maximum value of this calculation is obtained using the MAX function. The query correctly handles the division by casting `Free Meal Count (K-12)` to REAL to ensure accurate division results. The query also ensures that only the maximum value of the calculated ratio is returned, adhering to the instruction to output only the information asked in the question.",
    "revised_SQL": "SELECT MAX(CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) FROM frpm WHERE `County Name` = 'Alameda'"
}

To find the highest eligible free rate for K-12 students in schools located in Alameda County, we need to calculate the ratio of Free Meal Count (K-12) to Enrollment (K-12) for each school and then determine the maximum value of this ratio.

The database schema indicates that these fields are located in the frpm table, and the county information is stored in the County Name column.

The query filters records where County Name is ‘Alameda’ to focus on schools in Alameda County.

The calculation for the eligible free rate is performed by dividing Free Meal Count (K-12) by Enrollment (K-12), and the maximum value of this calculation is obtained using the MAX function.

The query correctly handles the division by casting Free Meal Count (K-12) to REAL to ensure accurate division results.

The query also ensures that only the maximum value of the calculated ratio is returned, adhering to the instruction to output only the information asked in the question.

AI给出的结果写入到了revised_SQL中：

1

SELECT MAX(CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) FROM frpm WHERE `County Name` = 'Alameda'

同一个问题会被询问2-3次，之后程序对进行评判回答是否正确：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


############################## Admin at step Evaluation ##############################

{
    "node_type": "evaluate_sql",
    "evaluation_for": "nl_to_sql",
    "exec_res": 1,
    "exec_err": "--",
    "Question": "What is the highest eligible free rate for K-12 students in the schools in Alameda County?",
    "Evidence": "Eligible free rate for K-12 = `Free Meal Count (K-12)` / `Enrollment (K-12)`",
    "GOLD_SQL": "SELECT `Free Meal Count (K-12)` / `Enrollment (K-12)` FROM frpm WHERE `County Name` = 'Alameda' ORDER BY (CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) DESC LIMIT 1",
    "PREDICTED_SQL": "SELECT MAX(CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) FROM frpm WHERE `County Name` = 'Alameda' "
}

############################## Admin at step Evaluation ##############################

{
    "node_type": "evaluate_sql",
    "evaluation_for": "revise_sql",
    "exec_res": 1,
    "exec_err": "--",
    "Question": "What is the highest eligible free rate for K-12 students in the schools in Alameda County?",
    "Evidence": "Eligible free rate for K-12 = `Free Meal Count (K-12)` / `Enrollment (K-12)`",
    "GOLD_SQL": "SELECT `Free Meal Count (K-12)` / `Enrollment (K-12)` FROM frpm WHERE `County Name` = 'Alameda' ORDER BY (CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) DESC LIMIT 1",
    "PREDICTED_SQL": "SELECT MAX(CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) FROM frpm WHERE `County Name` = 'Alameda'"
}

其中：

“GOLD_SQL"是在BIRD数据集中出现的标准答案
“PREDICTED_SQL"是AI产生的答案
“exec_res"是判断结果，如果结果正确则返回1，错误则返回0
“exec_err"是产生错误的原因，如果结果正确则返回”–”
“evaluation_for"是状态，evaluation_for: nl_to_sql 表示当前评估的目标是检查自然语言到 SQL 转换的准确性；evaluation_for: revise_sql 表示当前评估的目标是检查对原始 SQL 查询的修正是否正确。

下面是一个出现错误但最后被修正的日志：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


############################## Admin at step Evaluation ##############################

{
    "node_type": "evaluate_sql",
    "evaluation_for": "nl_to_sql",
    "exec_res": 0,
    "exec_err": "incorrect answer",
    "Question": "Please list the lowest three eligible free rates for students aged 5-17 in continuation schools.",
    "Evidence": "Eligible free rates for students aged 5-17 = `Free Meal Count (Ages 5-17)` / `Enrollment (Ages 5-17)`",
    "GOLD_SQL": "SELECT `Free Meal Count (Ages 5-17)` / `Enrollment (Ages 5-17)` FROM frpm WHERE `Educational Option Type` = 'Continuation School' AND `Free Meal Count (Ages 5-17)` / `Enrollment (Ages 5-17)` IS NOT NULL ORDER BY `Free Meal Count (Ages 5-17)` / `Enrollment (Ages 5-17)` ASC LIMIT 3",
    "PREDICTED_SQL": "SELECT T2.`Free Meal Count (Ages 5-17)` / T2.`Enrollment (Ages 5-17)` FROM schools AS T1 INNER JOIN frpm AS T2 ON T1.CDSCode = T2.CDSCode WHERE T2.`Educational Option Type` = 'Continuation School' AND T2.`Enrollment (Ages 5-17)` > 0 ORDER BY T2.`Free Meal Count (Ages 5-17)` / T2.`Enrollment (Ages 5-17)` ASC LIMIT 3 "
}

############################## Admin at step Evaluation ##############################

{
    "node_type": "evaluate_sql",
    "evaluation_for": "revise_sql",
    "exec_res": 1,
    "exec_err": "--",
    "Question": "Please list the lowest three eligible free rates for students aged 5-17 in continuation schools.",
    "Evidence": "Eligible free rates for students aged 5-17 = `Free Meal Count (Ages 5-17)` / `Enrollment (Ages 5-17)`",
    "GOLD_SQL": "SELECT `Free Meal Count (Ages 5-17)` / `Enrollment (Ages 5-17)` FROM frpm WHERE `Educational Option Type` = 'Continuation School' AND `Free Meal Count (Ages 5-17)` / `Enrollment (Ages 5-17)` IS NOT NULL ORDER BY `Free Meal Count (Ages 5-17)` / `Enrollment (Ages 5-17)` ASC LIMIT 3",
    "PREDICTED_SQL": "SELECT T2.`Free Meal Count (Ages 5-17)` / T2.`Enrollment (Ages 5-17)` AS EligibleFreeRate FROM schools AS T1 INNER JOIN frpm AS T2 ON T1.CDSCode = T2.CDSCode WHERE T1.EdOpsName = 'Continuation School' AND T2.`Enrollment (Ages 5-17)` > 0 AND T2.`Free Meal Count (Ages 5-17)` IS NOT NULL AND T2.`Enrollment (Ages 5-17)` IS NOT NULL ORDER BY EligibleFreeRate ASC LIMIT 3"
}

在所给的error_logs中，出错的回答日志被筛选出，可以通过检索**chain_of_thought_reasoning**查询思维链从而尝试找出失误的原因。

此外，论文从29页开始，给出了一些错误回答与错误原因：

failed_summary.json

CHESS项目给出了他们在运行BIRD测试集时产生的中间结果，存放在CHESS/results/chess_on_dev.zip at chess-v1 · ShayanTalaei/CHESS，经过合并，所有产生失误的数据被单独筛选出来，放入failed_summary.json中。

为了方便查找与定位，所有产生失误的数据都被添加了单独的标签failure_annotations，可以快速查找标签以定位问题产生的原因。

如：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


                "missing_table_status": "success",
                "missing_tables": [],
                "missing_column_status": "missing_column",
                "missing_columns": [
                    "'schools'.'county'"
                ],
                "correct_columns": {
                    "frpm": [
                        "School Name",
                        "cdscode",
                        "High Grade",
                        "Low Grade"
                    ],
                    "schools": [
                        "cdscode",
                        "county"
                    ]
                },
                "failure_annotations": [
                    "Field 'missing_column_status' failed with value: missing_column"
                ]
            },

产生了"missing_column"问题

又如：

1
2
3
4
5
6
7
8


            {
                "node_type": "nl_to_sql",
                "status": "error",
                "error": "<class 'TypeError'>: <'NoneType' object is not subscriptable>",
                "failure_annotations": [
                    "Field 'status' failed with value: error"
                ]
            },

在"nl_to_sql"环节发生错误。